Data science project for ANLY-501 | A data science tale about a coffee company.

This notebook provides a project report for the data science project done as part of ANLY-501. The data science project intends to show all stages of the data science pipeline. This notebook facilitates that by organizing different phases into different sections and having the code and visualizations all included in one place. Install Jupyter notebook software available from http://jupyter.org/ to run this notebook. All code for this project is available on Github as https://github.com/aarora79/sb_study and there is also a simple website associated with it https://aarora79.github.io/sb_study/. The entire code (including the generated output) can be downloaded as a zip file from Github via this URL https://github.com/aarora79/sb_study/archive/master.zip. To run the code, simply unzip the master.zip and say "python main.py". The code requires that Python 3.5.2|Anaconda 4.1.1 be installed on the machine. Also, since the code tries to get the datasets in realtime so an Internet connection is also required.

[Updated for Project3 11/30/2016, starting from section Predictions based on classification].

Data Science Problem

Starbucks Corporation is an American coffee company and coffeehouse chain with more than 24,000 stores across the world. This projects intends to explore the following data science problems:

Exploratory Data Analysis (EDA) about Starbucks store locations for example geographical distribution of stores by country, region, ownership model, brand name etc.
Find a relationship between Starbucks data with various economic and human development indices such as GDP, ease of doing business, rural to urban population ratio, literacy rate, revenue from tourist inflow and so on.
Predict which countries where Starbucks does not have a store today are most suitable for having Starbucks stores (in other words in which country where Starbucks does not have a presence should Starbucks open its next store and how many).

This problem is important because it attempts to provide a model (using Starbucks as an example) which can be applied to any similar business (say in the food and hospitality industry) to predict where it could expand globally. It provides insight about where the business is located currently by bringing out information like say the total number of stores (48) found in the entire continent of Africa is way less than number of stores (184) found just on U.S. Airports.

Potential Analysis that Can Be Conducted Using Collected Data

The data used as part of this project is obtained from two sources.

Starbucks store location data is available from https://opendata.socrata.com via an API. The Socrate Open Data API (SODA) end point for this data is https://opendata.socrata.com/resource/xy4y-c4mk.json
The economic and urban development data for various countries is available from the World Bank(WB) website. WB APIs are described here https://datahelpdesk.worldbank.org/knowledgebase/topics/125589-developer-information. The API end point for all the World Development Indicators (WDI) is http://api.worldbank.org/indicators?format=json. Some examples of indicators being collected include GDP, urban to rural population ratio, % of employed youth (15-24 years), international tourism receipts (as % of total exports), ease of doing business and so on.

The possible directions/hypothesis based on collected data (this is not the complete list, would be expanded as the project goes on):

EDA about Starbucks store locations (for example):
- What percentage of stores exists in high income, high literacy, high urban to rural population ratio European countries Vs say high population, rising GDP, low urban to rural population Asian countries.
- Distribution of stores across geographies based on type of ownership (franchisee, joint venture etc.), brand name etc.
- Which country, which city has the most Starbucks stores per 1000 people.
- Is there a Starbucks always open at any UTC time during a 24hour period i.e. you can always find some Starbucks store open time at any given time somewhere in some timezone around the world.
Data visualization of the Starbucks store data:
- World map showing starbucks locations around the world.
- Heat map of the world based on the number of Starbucks store in a country.
- Frequency distribution of Starbucks store by city in a given country. Does this distribution resemble any wellll known statistical distribution.
- Parallel coordinates based visualization for number of stores combined with economic and urban development indicators.
Machine learning model for predicting number of Starbucks store based on various economic and urban development indicators. This could then be used to predict which countries where Starbucks does not have a presence today would be best suited as new markets. For example, model the number of Starbucks location in a country based on a) ease of doing business, b) GDP, c) urban to rural population, d) employment rate between people in the 15-24year age group, e) type of government, f) access to 24hour electricity and so on and so forth.

Data Issues

The data used for this project is being obtained via APIs from the Socrata web site and the World Bank website and is therefore expected to be relatively error free (for example as compared to the same data being obtained by scraping these websites). Even so, the data is checked for quality and appropriate error handling or even alternate mechanisms are put in place to handle errors/issues with the data.

Issue	Handling Strategy
Some of the city names (for examples cities in China) include UTF-16 characters and would therefore not display correctly in documents and charts.	Replace city name with country name _1, _2 and so on, for example CN_1, CN_2 etc.
Missing data in any of the fields in the Starbucks dataset.	Ignore the data for the location with any mandatory field missing (say country name is missing). Keep a count of the locations ignored due to missing data to get a sense of the overall quality of data.
Incorrect format of the value in various fields in the Starbucks dataset. For example Latitude/Longitude values being out of range, country codes being invalid etc.	Ignore the data for the location with any missing value. Keep a count of the locations ignored due to missing data to get a sense of the overall quality of data.
Misc. data type related errors such as date time field (first_seen field in Starbucks dataset) not being correct, fields considered as primary key not being unique (for example store id for Starbuck dataset, Country code for WB dataset)	Flag all invalid fields as errors.
Missing data for any of the indicators in the WB dataset.	The most recent year for which the data is available is 2015, if for a particular indicator the 2015 data is not available then use data for the previous year i.e. 2014. If no data is available for that indicator even for the previous 5 years then flag it as such and have the user define some custom value for it.
Incorrect format of the value in various fields in the WB dataset. For example alphanumeric or non-numeric data for fields such as GDP for which numeric values are expected.	Provide sufficient information to the user (in this case the programmer) about the incorrect data and have the user define correct values.

The subsequent sections of this notebook provide a view of the data for the two datasets used in this project and also provide the data quality scores (DQS) for both the datasets.

Datasets (Starbucks and WorldBank WDI)

The Python code for the SB Study (SBS) is run offline and the results (along with the code) are periodically pushed to Github. The code here simply downloads the files from Github to show the results and how the data looks like.



In [9]:

    
import pandas as pd
#get the Starbucks dataset from Github
#note that the dataset 
SB_DATASET_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/SB_data_as_downloaded.csv'
df = pd.read_csv(SB_DATASET_URL)

print('Starbucks dataset has %d rows and %d columns ' %(df.shape[0], df.shape[1]))
df.head()









    



Starbucks dataset has 24823 rows and 21 columns 






    Out[9]:






  
    
      
      brand
      city
      coordinates
      country
      country_subdivision
      current_timezone_offset
      first_seen
      latitude
      longitude
      name
      ...
      ownership_type
      phone_number
      postal_code
      store_id
      store_number
      street_1
      street_2
      street_3
      street_combined
      timezone
    
  
  
    
      0
      Starbucks
      Hong Kong
      {u'latitude': u'22.3407001495361', u'needs_rec...
      CN
      91
      480
      2013-12-08T22:41:59
      22.340700
      114.201691
      Plaza Hollywood
      ...
      LS
      {u'phone_number': u'29554570'}
      NaN
      1
      34638-85784
      Level 2, Plaza Hollywood, Diamond Hill,
      Kowloon
      NaN
      Level 2, Plaza Hollywood, Diamond Hill,, Kowloon
      China Standard Time
    
    
      1
      Starbucks
      Hong Kong
      {u'latitude': u'22.2839393615723', u'needs_rec...
      CN
      91
      480
      2013-12-08T22:41:59
      22.283939
      114.158188
      Exchange Square
      ...
      LS
      {u'phone_number': u'21473739'}
      NaN
      6
      34601-20281
      Shops 308-310, 3/F.,
      Exchange Square Podium, Central, HK.
      NaN
      Shops 308-310, 3/F.,, Exchange Square Podium, ...
      China Standard Time
    
    
      2
      Starbucks
      Kowloon
      {u'latitude': u'22.3228702545166', u'needs_rec...
      CN
      91
      480
      2013-12-08T22:41:59
      22.322870
      114.213440
      Telford Plaza
      ...
      LS
      {u'phone_number': u'27541323'}
      NaN
      8
      34610-28207
      Shop Unit G1A, Atrium A, Telford Plaza I
      , Kowloon Bay, Kowloon
      NaN
      Shop Unit G1A, Atrium A, Telford Plaza I, , Ko...
      China Standard Time
    
    
      3
      Starbucks
      Hong Kong
      {u'latitude': u'22.2844505310059', u'needs_rec...
      CN
      91
      480
      2013-12-08T22:41:59
      22.284451
      114.158463
      Hong Kong Station
      ...
      LS
      {u'phone_number': u'25375216'}
      NaN
      13
      34622-64463
      Concession HOK 3a & b
      LAR Hong Kong Station
      NaN
      Concession HOK 3a & b, LAR Hong Kong Station
      China Standard Time
    
    
      4
      Starbucks
      Hong Kong
      {u'latitude': u'22.2777309417725', u'needs_rec...
      CN
      91
      480
      2013-12-08T22:41:59
      22.277731
      114.164917
      Pacific Place, Central
      ...
      LS
      {u'phone_number': u'29184762'}
      NaN
      17
      34609-22927
      Shop 131, Level 1, Pacific Place
      88 Queensway, HK
      NaN
      Shop 131, Level 1, Pacific Place, 88 Queensway...
      China Standard Time
    
  

5 rows × 21 columns



In [10]:

    
import pandas as pd
#get the Starbucks dataset from Github
#note that the dataset 
WB_DATASET_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/WDI_data_as_downloaded.csv'
df = pd.read_csv(WB_DATASET_URL)

print('Worldbank dataset has %d rows and %d columns ' %(df.shape[0], df.shape[1]))
df.head()









    



Worldbank dataset has 264 rows and 84 columns 






    Out[10]:






  
    
      
      country_code
      IC.TAX.LABR.CP.ZS
      WP_time_01.1
      SP.POP.1564.TO.ZS
      IC.BUS.NDNS.ZS
      IC.LGL.CRED.XQ
      IC.GOV.DURS.ZS
      DT.DOD.PVLX.GN.ZS
      SE.ADT.LITR.ZS
      IC.EXP.COST.CD
      ...
      SL.SRV.EMPL.ZS
      FI.RES.TOTL.DT.ZS
      IC.FRM.BRIB.ZS
      IC.TAX.OTHR.CP.ZS
      IC.REG.COST.PC.ZS
      IC.ELC.OUTG
      SL.EMP.WORK.ZS
      TX.VAL.OTHR.ZS.WT
      EN.URB.LCTY.UR.ZS
      SI.POV.2DAY
    
  
  
    
      0
      BD
      NaN
      NaN
      65.579469
      NaN
      6.0
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      13.9
      NaN
      NaN
      62.451180
      31.889816
      NaN
    
    
      1
      BE
      49.4
      NaN
      64.830742
      NaN
      4.0
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      0.6
      4.8
      NaN
      NaN
      59.897278
      18.516810
      NaN
    
    
      2
      BF
      21.4
      NaN
      52.034628
      NaN
      6.0
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      3.7
      43.5
      NaN
      NaN
      NaN
      50.703959
      NaN
    
    
      3
      BG
      20.2
      NaN
      65.828506
      NaN
      9.0
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      1.8
      0.7
      NaN
      NaN
      NaN
      23.100215
      NaN
    
    
      4
      VE
      18.0
      NaN
      65.627593
      NaN
      1.0
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      37.1
      88.7
      NaN
      NaN
      14.755198
      10.534170
      NaN
    
  

5 rows × 84 columns

Data Quality Scores (DQS)

To check for dataquality two types of checks were done on both the datasets.

Missing data: any cell in the dataset which was empty was counted as missing data. This is checked for all features in both the datasets.
Invalid data: validation checks were done on many (9 fields in the Starbucks dataset, and all 83 fields in the Worldbank dataset). These checks were both generic (validation for numeric data, absence of special characters, validation of timestamp data, latitude and longitude validation, timezone offset and so on and so forth) as well context specific (store-id has to be unique).

Two metrices are derived to quantify missing data and invalid data. These are raw score and adjusted score.

Raw Score for Missing Data: this is the percentage of data that is not missing, simply speaking this is the count of non-empty cells Vs total number of cells expressed as a percentage. The higher this score the cleaner the dataset is from the perspective of missing data. This score is available for both Starbucks and WorldBank datasets.
Adjusted Raw Score for Missing Data: this is the percentage of data that is not missing for Mandatory Features (i.e. features without which the record would have to be discarded), simply speaking this is the count of non-empty cells in mandatory columns Vs total number of cells expressed as a percentage. In case all features are mandatory then the adjusted raw score and the raw score are the same. The higher this score the cleaner the dataset is from the perspective of missing data. This score is available for both Starbucks and WorldBank datasets.
Raw Score for Invalid Data: this is the percentage of data that is not invalid, simply speaking this is the count of cells containing valid data Vs total number of cells expressed as a percentage. The higher this score the cleaner the dataset is from the perspective of invalid data. Validity checks are both generic and context specific as defined above. This score is available for both Starbucks and WorldBank datasets.
Adjusted Score for Invalid Data: this is the percentage of data that is not invalid for Mandatory Features (i.e. features without which the record would have to be discarded), simply speaking this is the count of cells containing valid data in mandatory columns Vs total number of cells expressed as a percentage. The higher this score the cleaner the dataset is from the perspective of invalid data. Validity checks are both generic and context specific as defined above. This score is available for both Starbucks and WorldBank datasets.

Validity checks implemented for Starbucks dataset

Field	Validity Check
Store Id	Has to be unique for each row
Latitude/Longitude	Longitude measurements range from 0° to (+/–)180°, Latitude from -90° to +90°
Timezone offset	Has to be divisible by 15
Country code	Has to be present in a country code list downloaded offline
Brand	Has to be within one of 6 brands listed on Starbucks website http://www.starbucks.com/careers/brands
Store Number	Has to follow a format XXXX-YYYY
Timzone name	Has to have either "Standard Time" or "UTC" as part of name
Ownership Type	Should not contain special characters
First seen	Has to follow a date format

Validity checks implemented for Worldbank dataset

Field	Validity Check
Country Code	Should not contain special characters
All 83 features (WDIs)	Should be numeric as all of them represent a numerical measure of the feature

DQS for Starbucks and WorldBank datasets

The following code fragment downloads a CSV file containing these scores from the Github repo of this project. These scrores have been calculated offline by running the program and uploading the results as part of the Github repo. The missing data raw score for the WorldBank data is very less ~34%, this is because for several parameters the 2015 data is not yet available, so in this case we will use the 2014 data as these are macro level indicators which do not see a drastic change year on year (this will be disussed more in the next phase of the project).



In [11]:

    
import pandas as pd
DQS_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/dqs.csv'

df = pd.read_csv(DQS_URL)
df.head()









    Out[11]:






  
    
      
      Datasource
      Invalid_Data_Raw_Score
      Invalid_Data_Adjusted_Score
      Missing_Data_Raw_Score
      Missing_Data_Adjusted_Score
    
  
  
    
      0
      WorldBank
      100.000000
      100.000000
      33.648990
      33.64899
    
    
      1
      Starbucks
      99.999233
      99.999233
      91.523031
      100.00000

Feature Creation

Three (for now) new features are created using the datasets. These features are all added to the Starbucks dataset. These are described below:

Feature	Format	Explanation
Continent	String	Map the country code to the continent to which the country belongs. This feature enables answering basic questions like what % of stores exist in Asia as compared to North America for example. The country code to continent mapping is obtained with the help of two csv files downloaded offline (code to country mapping, country to continent mapping), first country code is mapped to country name and then country name is mapped to the continent name.
On Airport	Boolean	Does this store exists on an Airport? This feature helps answer questions like how many of the Starbucks in Europe exists on Airport for example. For now a store is considered to be on an airport if the name field or the street combined field for a store happens to have either one the following terms "Airport, Arpt, Terminal" contained in it. Going forward this logic will be replaced with an IATA API (see comments in code, wb.py file).
Ease of doing business Category	String	The ease of doing business index (obtained from the WorldBank data) for the country in which the store exists is mapped to a category viz. Very High (1 to 10), High (11 to 30), Medium (31, 90), Low (91 to 130), Very Low (131 and beyond) and finally Unknown for which countries where there is no data available. This feature provides insights like around 70% of Starbucks stores exist in countries which have an ease of doing business index between 1 to 10 (with 1 being the highest and 189 being the lowest). The ease of doing business index is obtained by looking up the country code in the WDI dataset to find the value for the ease of doing business index.

The following table shows a view of the modified dataset with new features added.



In [2]:

    
import pandas as pd
SB_W_NEW_FEATURES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/SB_data_w_features.csv'

df = pd.read_csv(SB_W_NEW_FEATURES_URL)

#show only some of the columns so that new feature can be shown without a scrollbar
cols = ['store_id', 'country', 'name', 'city', 'brand', 'continent', 'on_airport', 'eodb_category']
df[cols].head()









    Out[2]:






  
    
      
      store_id
      country
      name
      city
      brand
      continent
      on_airport
      eodb_category
    
  
  
    
      0
      1
      CN
      Plaza Hollywood
      Hong Kong
      Starbucks
      Asia
      False
      M
    
    
      1
      6
      CN
      Exchange Square
      Hong Kong
      Starbucks
      Asia
      False
      M
    
    
      2
      8
      CN
      Telford Plaza
      Kowloon
      Starbucks
      Asia
      False
      M
    
    
      3
      13
      CN
      Hong Kong Station
      Hong Kong
      Starbucks
      Asia
      False
      M
    
    
      4
      17
      CN
      Pacific Place, Central
      Hong Kong
      Starbucks
      Asia
      False
      M

Some notes about the code and how to run the SBS program

This section provides information about how to run the code, how is the code organized amongst different modules and what are the output files generated.

How to run the code

When you unzip the zip file from Blackboard Github, you will see a folder called sb_study created which contains all the code, helper files and even an output folder (the contents of the output folder get rewritten on every run of the SBS program). As a pre-requisite the machine on which this program is being run is required to have Python 3.5.2|Anaconda 4.1.1 installed. Run the program by simply saying

python main.py

The code will take about 2 to 3 minutes to run completely. As the program runs it would spit out traces on the console and these traces are also automatically being logged into a file called SBS.log created in the output directory (also created by the SBS program in the current working directory if not already present). The program would create a directory called output and several sub-directories inside it to store the various artifacts created (csv files, plots etc.) as part of running the program.

The program can also be run in analysis only mode such that it assumes that the data has already been downloaded by running the program in the full mode first. In the analysis only mode the program runs various forms of analysis on the data and stores the created artificats under subdirectories in the output folder. This mode is useful if only the analysis needs to be redone and the entire datsset does not need to be downloaded again. To run the program in analysis mode use the following command line.

python main.py -a

Code organization and other files

Here is a listing of everything that is included alongwith a brief description.

File name	Notes
main.py	Everything starts from here, as the name suggests this is the main file of the package. This is what we run. It is organized to run a series of functions like a data science pipeline, for example get_data, clean_data and so on. It internally invokes the wb (for world bank) module and the sb (for Starbucks) module to do the same tasks for the Worldbank and Starbucks datasets respectively.
README.md	This is the readme markdown file for Github.
LICENCE.md	This is the license markdown file for Github. The SBS project is available under MIT license.
countries.csv	A CSV file which provides the country to continent mapping.
data.csv	A CSV file which provides the country code to country name mapping. Downloaded offline.
WDI_Series.csv	A CSV file containing the name and description of World Development Indicators (WDI) selected for use in this project. Created offline.
wb	Directory for the world bank submodule
wb\wb.py	Main file for the WorldBank submodule. It contains functions to run the data science pipeline for the WorldBank dataset.
wb\wb_check_quality.py	This file performs the data quality checks i.e. missing data and invalid data checks for the WorldBank data. It internally invokes the utils.py module for some of the data validation checks.
sb\sb.py	Main file for the Starbucks submodule. It contains functions to run the data science pipeline for the Starbucks dataset. It also contains code for feature generation.
sb\sb_check_quality.py	This file performs the data quality checks i.e. missing data and invalid data checks for the Starbucks data. It internally invokes the utils.py module for some of the data validation checks. Many of the data validation checks done are specific to the Starbucks data set and are included in this file.
common	Directory for the common submodule.
common\utils.py	This file contains utility function that are common to both the modules and so this is the once place to implement these functions. Several data validation checks are also included in this file.
common\logger.py	This file creates a logger object using the Python logging module. It sets the format specifier so that the trace messages have the desired format (timestamp, module name, etc.) and also takes care of attaching two handlers, one for the console and one for the log file so all traces go to both the console and the log file.
common\globals.py	This file contains the common constants and even some variables that are common across modules. Think of this is a common header file.
output	This is the folder which contains all the output (including log file) produced by the SBS program. The SBS program creates this directory if not present.
output\SBS.log	This is the log file created upon running the program. This file also contained more detailed DQS metrics i.e. missing data and invalid data related errors for individual features.
output\SB_data_as_downloaded.csv	This is the CSV version of the Starbucks data downloaded via the Socrata API.
output\SB_data_w_features.csv	This is the CSV version of Starbucks data plus with the new features added to it (last three columns) as part of the feature creation activity.
output\WB_data_as_downloaded.csv	This is the CSV version of the WorldBank data downloaded via the WB API.
output\dqs.csv	This is the CSV file which contains the data quality score for both the datasets.

A note about the structure of the output folder.

The output folder contains several csv file which correspond to data as downloaded, cleaned data and combined data.

WDI_SB.csv: this is the combined dataset, all analysis is run on this dataset.
DQS: this folder contains various dqs generated at different stages, contains multiple files, the contents are self-explnatory. The Jupyter notebooks displays the contents of these files.
EDA: this folder ontains artifacts from the exploratory data analysis, which include csv files and plots, the contents are self-explnatory. The Jupyter notebooks displays the contents of these files.
scatter: this folder contains various scatter plots generated for the combined dataset. Specifically, the plots between number of Starbucks store and the WDI indicators.
regression: this folder contain various artifacts (csv files and plots) generated as part of regression.
clustering: this folder contains various artifacts (csv files and plots) generated as part of clustering.
classification: this folder is a placeholder for artificats created as part of classification. Currently the classification results can be seen in the Jupyter notebook and the SBS.log file.
association_rules: this folder contains various artifacts (csv files only) generated as part of association rule mining. These csv files are included as part of the Jupyter notebook.

Combinining the two datasets

The WorldBank WDI indicators dataset and the Starbucks datasets are combined together for hypothesis testing and machine learning described later in this notebook. To combine the two datasets we first filter out all the countries from the WorldBank dataset which do not have a Starbucks store i.e. we keep only those countries in the combined dataset that have at least one Starbucks store. This combined dataset is then cleaned (described later in this document) and all analysis is run on it. There are 72 countries in the world where Starbucks is present so the combined dataset contains 72 rows and 48 columns (more on the column count later). The following fields computed from the Starbucks dataset are added to the combined dataset:

Feature	Description
Num.Starbucks.Stores	Total number of Starbucks stores in the country
Num.Starbucks.Stores.Categorical	Total number of Starbucks stores in the country categorized as VL (VeryLow), L (Low), Medium (M), High(H), Very High(VH)
Starbucks.Store.Density	Number of Starbucks stores per 100,000 people
Starbucks.Store.Density.Categorical	Number of Starbucks stores per 100,000 people categorized as VL (VeryLow), L (Low), Medium (M), High(H), Very High(VH)
Exists.inMultiple.Cities	True if Starbucks stores exists in multiple cities in a country, False otherwise
Ownership.Type.Mixed	True if multiple ownership types exists for Starbucks stores in a country (Franchisee owned, Starbucks owned etc).
Continent	Name of the continent in which the country exists in
MultipleBrands	True if Starbucks exists as multiple brands in the country, False otherwise

The shape of the combined dataset is (72,48). This includes 35 WDI (World Development Indicators) from the WorldBank datset.

For converting continous columns to categorical a quantile (percentile) based binning strategy is used.

Range	Category
<= 10th percentile	Very Low (VL)
<=20th percentile	Low (L)
<=60th percentile	Medium (M)
<= 80th percentile	High (H)
beyond 80th percentile	Very High (VH)

Exploratory Data Analysis

This section describes various types of exploratory analysis run on the two datasets as well as on the combined dataset before getting into the hypothesis building and machine learning part.

Basic Statistical Analysis and data cleaning

This section provides basic statistical analysis for both the WorldBank dataset and the StarBucks datasets. The following boxplot provides a quick overview into the distribution of Starbucks stores across continents. Some quick observations from the box plot are described below:

There is a huge variation in the distribution of Starbucks stores in the continent of North America. This is on expected lines since other than the U.S. and Canada all other countries are islands. Also the North American continent contains the country with the most number of Starbucks stores (the U.S. ofcourse, no surprises there).
There is minimum variation in the distribution of Starbucks stores in the continent of Oceania since it consists of only two Countries which have Starbucks stores i.e. Australia and New Zealend.
The European continent has the most even distributed around the median.

Mean, mode, median, standard deviation

The WorldBank dataset consists of all numeric values except for the country code, country name and the derived feature for ease of doing business (categorical). The country code and name are excluded from the statistical analysis since there is onnly one row per country in the dataset. The statistical measures for each feature are listed below. Some of the features show NaN for the statistical measures, that is because there is no data at all for these features (they are subsequently removed from the dataset, described later).



In [19]:

    
import pandas as pd
WB_EDA_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/WB_EDA.csv'
df = pd.read_csv(WB_EDA_URL)
df









    Out[19]:






  
    
      
      feature
      mean
      median
      mode
      stddev
    
  
  
    
      0
      IC.TAX.METG
      1.467218e+00
      1.435000e+00
      0
      5.794943e-01
    
    
      1
      SL.EMP.TOTL.SP.FE.NE.ZS
      4.530958e+01
      4.750000e+01
      0
      1.220095e+01
    
    
      2
      IT.NET.USER.P2
      4.755657e+01
      4.688564e+01
      0
      2.769172e+01
    
    
      3
      SP.POP.TOTL
      2.970610e+08
      1.009667e+07
      0
      9.376648e+08
    
    
      4
      VC.PKP.TOTL.UN
      4.955647e+03
      7.930000e+02
      0
      6.086792e+03
    
    
      5
      IC.GOV.DURS.ZS
      8.830681e+00
      8.956250e+00
      0
      4.266066e+00
    
    
      6
      IC.REG.PROC
      6.938789e+00
      7.000000e+00
      0
      2.928893e+00
    
    
      7
      IC.LGL.CRED.XQ
      5.110327e+00
      5.000000e+00
      0
      2.741384e+00
    
    
      8
      EG.ELC.ACCS.UR.ZS
      NaN
      NaN
      0
      NaN
    
    
      9
      IC.BUS.NDNS.ZS
      3.674968e+00
      2.290769e+00
      0
      4.270679e+00
    
    
      10
      EN.URB.LCTY.UR.ZS
      2.988715e+01
      2.684948e+01
      0
      1.682697e+01
    
    
      11
      SL.EMP.TOTL.SP.MA.NE.ZS
      6.361581e+01
      6.384569e+01
      0
      8.498745e+00
    
    
      12
      IC.ELC.DURS
      3.351995e+01
      3.074444e+01
      0
      2.697875e+01
    
    
      13
      SP.URB.TOTL.IN.ZS
      5.862963e+01
      5.837484e+01
      0
      2.313314e+01
    
    
      14
      NY.GNP.PCAP.CD
      1.267481e+04
      5.585000e+03
      0
      1.769632e+04
    
    
      15
      SL.TLF.ACTI.1524.NE.ZS
      4.255292e+01
      4.113231e+01
      0
      1.256837e+01
    
    
      16
      IC.WRH.DURS
      1.611548e+02
      1.540000e+02
      0
      8.145088e+01
    
    
      17
      IC.TAX.TOTL.CP.ZS
      4.088493e+01
      3.927222e+01
      0
      1.914462e+01
    
    
      18
      SI.POV.GINI
      4.012033e+01
      4.172000e+01
      0
      9.002209e+00
    
    
      19
      SI.POV.2DAY
      1.790452e+01
      1.129000e+01
      0
      2.032023e+01
    
    
      20
      IC.ELC.TIME
      9.773877e+01
      8.700000e+01
      0
      6.537607e+01
    
    
      21
      IC.TAX.OTHR.CP.ZS
      8.843270e+00
      3.000000e+00
      0
      1.803673e+01
    
    
      22
      IC.ISV.DURS
      2.597505e+00
      2.581818e+00
      0
      1.055514e+00
    
    
      23
      FB.ATM.TOTL.P5
      4.842994e+01
      4.170072e+01
      0
      4.303515e+01
    
    
      24
      SL.UEM.NEET.ZS
      1.347953e+01
      1.228000e+01
      0
      6.136543e+00
    
    
      25
      EG.ELC.RNEW.ZS
      NaN
      NaN
      0
      NaN
    
    
      26
      IQ.WEF.PORT.XQ
      3.958079e+00
      3.904944e+00
      0
      1.123580e+00
    
    
      27
      IC.TAX.GIFT.ZS
      1.529794e+01
      1.701000e+01
      0
      8.678293e+00
    
    
      28
      IQ.WEF.CUST.XQ
      3.983466e+00
      3.813284e+00
      0
      8.156708e-01
    
    
      29
      EN.ATM.PM25.MC.ZS
      NaN
      NaN
      0
      NaN
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      53
      BX.KLT.DINV.WD.GD.ZS
      4.660374e+00
      2.556654e+00
      0
      8.348388e+00
    
    
      54
      TX.VAL.TECH.CD
      7.403359e+10
      2.120983e+08
      0
      2.555456e+11
    
    
      55
      EN.CLC.MDAT.ZS
      NaN
      NaN
      0
      NaN
    
    
      56
      SL.EMP.WORK.ZS
      8.124434e+01
      8.417780e+01
      0
      9.151259e+00
    
    
      57
      SH.STA.ACSN.UR
      7.898257e+01
      8.797773e+01
      0
      2.325356e+01
    
    
      58
      WP_time_01.1
      5.330584e+01
      5.126150e+01
      0
      3.125552e+01
    
    
      59
      SL.IND.EMPL.ZS
      2.338136e+01
      2.320000e+01
      0
      6.635862e+00
    
    
      60
      EN.POP.SLUM.UR.ZS
      4.404918e+01
      4.300000e+01
      0
      2.234641e+01
    
    
      61
      SP.POP.1564.TO.ZS
      6.374940e+01
      6.536425e+01
      0
      6.471699e+00
    
    
      62
      SE.ADT.LITR.ZS
      8.416093e+01
      9.397165e+01
      0
      1.876130e+01
    
    
      63
      FI.RES.TOTL.DT.ZS
      9.210792e+01
      4.777428e+01
      0
      2.829776e+02
    
    
      64
      IC.BUS.DISC.XQ
      5.470902e+00
      5.487101e+00
      0
      2.288636e+00
    
    
      65
      CM.MKT.TRAD.GD.ZS
      6.147025e+01
      1.907471e+01
      0
      1.006608e+02
    
    
      66
      SL.SRV.EMPL.ZS
      6.585253e+01
      6.905000e+01
      0
      1.252427e+01
    
    
      67
      NE.CON.PRVT.KD.ZG
      2.794388e+00
      3.038269e+00
      0
      4.618765e+00
    
    
      68
      AG.LND.TOTL.UR.K2
      NaN
      NaN
      0
      NaN
    
    
      69
      FP.CPI.TOTL.ZG
      3.764240e+00
      1.835994e+00
      0
      9.904829e+00
    
    
      70
      ST.INT.RCPT.XP.ZS
      1.426136e+01
      6.996199e+00
      0
      1.748747e+01
    
    
      71
      IC.REG.COST.PC.ZS
      2.671969e+01
      1.397500e+01
      0
      3.990623e+01
    
    
      72
      LP.LPI.OVRL.XQ
      2.867913e+00
      2.739469e+00
      0
      5.229044e-01
    
    
      73
      IC.FRM.BRIB.ZS
      2.187780e+01
      2.390000e+01
      0
      1.124342e+01
    
    
      74
      IC.FRM.DURS
      2.277239e+01
      1.909048e+01
      0
      1.139392e+01
    
    
      75
      IT.CEL.SETS.P2
      1.063206e+02
      1.078952e+02
      0
      3.922009e+01
    
    
      76
      IC.IMP.COST.CD
      1.883703e+03
      1.480000e+03
      0
      1.511363e+03
    
    
      77
      IC.EXP.COST.CD
      1.564382e+03
      1.270810e+03
      0
      1.127318e+03
    
    
      78
      IC.TAX.LABR.CP.ZS
      1.572635e+01
      1.405870e+01
      0
      9.984998e+00
    
    
      79
      IC.ELC.OUTG
      7.778588e+00
      5.650000e+00
      0
      8.081534e+00
    
    
      80
      EN.ATM.CO2E.EG.ZS
      NaN
      NaN
      0
      NaN
    
    
      81
      DT.DOD.PVLX.GN.ZS
      2.090316e+01
      1.743921e+01
      0
      1.352618e+01
    
    
      82
      Ease.Of.Doing.Business
      0.000000e+00
      0.000000e+00
      VeryLow:70
      0.000000e+00
    
  

83 rows × 5 columns

The Starbucks dataset consists of all categorical features, therefore only a mode calculation is valid. Several features such as street names are excluded from the mode calculation as well. There are some intresting observations to be made

The maximum number of Starbucks store in a single city are in a city outside of the U.S.
The maximum number of Starbucks store in the U.S. and in the Eastern Timezone.
The maximum number of Starbucks stores (about 70%) occur in countries with very high ease of doing business.



In [20]:

    
import pandas as pd
SB_EDA_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/SB_EDA.csv'
df = pd.read_csv(SB_EDA_URL)
df









    Out[20]:






  
    
      
      feature
      mean
      median
      mode
      stddev
    
  
  
    
      0
      brand
      0
      0
      Starbucks:24733
      0
    
    
      1
      city
      0
      0
      上海市:498
      0
    
    
      2
      country
      0
      0
      US:13486
      0
    
    
      3
      ownership_type
      0
      0
      CO:11777
      0
    
    
      4
      timezone
      0
      0
      Eastern Standard Time:5460
      0
    
    
      5
      continent
      0
      0
      North America:15568
      0
    
    
      6
      on_airport
      0
      0
      False:24641
      0
    
    
      7
      eodb_category
      0
      0
      VH:15549
      0

Outlier Detection

Outlier detection is performed on the WorldBank dataset. We determine outliers using a simple rule, anything outside of the 3rd standard deviation is an outlier. The results of outlier detection are provided below. Note that while there are values which get categorized as outliers but they are still valid values (as understood by human inspection of the data) therefore no outlier handling as such (smoothning, removal etc) is done for these values. Also given that these are valid values they cannot be ignored as they may have a significant role in the machine learning algorithms.

No outlier detection is performed on the Starbucks dataset as it contains only categorical values (values outside of well defined valid values set are considered invalid and that analysis has already been performed as part of the data quality detection phase).



In [21]:

    
import pandas as pd
WB_OUTLIER_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/WB_outliers.csv'
df = pd.read_csv(WB_OUTLIER_URL)
df









    Out[21]:






  
    
      
      dataset
      entry
      field
      value
      3rdStdDev
    
  
  
    
      0
      WB
      Early-demographic dividend
      SP.POP.TOTL
      3.122703e+09
      2.812994e+09
    
    
      1
      WB
      Middle income
      SP.POP.TOTL
      5.521157e+09
      2.812994e+09
    
    
      2
      WB
      IBRD only
      SP.POP.TOTL
      4.542581e+09
      2.812994e+09
    
    
      3
      WB
      Low & middle income
      SP.POP.TOTL
      6.159443e+09
      2.812994e+09
    
    
      4
      WB
      IDA & IBRD total
      SP.POP.TOTL
      6.183635e+09
      2.812994e+09
    
    
      5
      WB
      World
      SP.POP.TOTL
      7.346633e+09
      2.812994e+09
    
    
      6
      WB
      Bhutan
      IC.GOV.DURS.ZS
      2.880000e+01
      1.279820e+01
    
    
      7
      WB
      Equatorial Guinea
      IC.REG.PROC
      1.800000e+01
      8.786678e+00
    
    
      8
      WB
      Venezuela, RB
      IC.REG.PROC
      1.700000e+01
      8.786678e+00
    
    
      9
      WB
      Philippines
      IC.REG.PROC
      1.600000e+01
      8.786678e+00
    
    
      10
      WB
      New Zealand
      IC.BUS.NDNS.ZS
      1.663000e+01
      1.281204e+01
    
    
      11
      WB
      Malta
      IC.BUS.NDNS.ZS
      1.726000e+01
      1.281204e+01
    
    
      12
      WB
      Hong Kong SAR, China
      IC.BUS.NDNS.ZS
      3.130000e+01
      1.281204e+01
    
    
      13
      WB
      Hong Kong SAR, China
      EN.URB.LCTY.UR.ZS
      1.000000e+02
      5.048091e+01
    
    
      14
      WB
      Macao SAR, China
      EN.URB.LCTY.UR.ZS
      9.945780e+01
      5.048091e+01
    
    
      15
      WB
      Singapore
      EN.URB.LCTY.UR.ZS
      1.000000e+02
      5.048091e+01
    
    
      16
      WB
      Ethiopia
      IC.ELC.DURS
      1.943000e+02
      8.093626e+01
    
    
      17
      WB
      Luxembourg
      NY.GNP.PCAP.CD
      7.700000e+04
      5.308897e+04
    
    
      18
      WB
      Switzerland
      NY.GNP.PCAP.CD
      8.418000e+04
      5.308897e+04
    
    
      19
      WB
      Qatar
      NY.GNP.PCAP.CD
      8.543000e+04
      5.308897e+04
    
    
      20
      WB
      Norway
      NY.GNP.PCAP.CD
      9.382000e+04
      5.308897e+04
    
    
      21
      WB
      Macao SAR, China
      NY.GNP.PCAP.CD
      7.630000e+04
      5.308897e+04
    
    
      22
      WB
      Barbados
      IC.WRH.DURS
      4.420000e+02
      2.443526e+02
    
    
      23
      WB
      Cyprus
      IC.WRH.DURS
      6.170000e+02
      2.443526e+02
    
    
      24
      WB
      Cambodia
      IC.WRH.DURS
      6.520000e+02
      2.443526e+02
    
    
      25
      WB
      Brazil
      IC.WRH.DURS
      4.257000e+02
      2.443526e+02
    
    
      26
      WB
      Zimbabwe
      IC.WRH.DURS
      4.480000e+02
      2.443526e+02
    
    
      27
      WB
      Argentina
      IC.TAX.TOTL.CP.ZS
      1.374000e+02
      5.743387e+01
    
    
      28
      WB
      Comoros
      IC.TAX.TOTL.CP.ZS
      2.165000e+02
      5.743387e+01
    
    
      29
      WB
      Bangladesh
      IC.ELC.TIME
      4.289000e+02
      1.961282e+02
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      118
      WB
      Macao SAR, China
      ST.INT.RCPT.XP.ZS
      9.367504e+01
      5.246241e+01
    
    
      119
      WB
      South Sudan
      IC.REG.COST.PC.ZS
      3.301000e+02
      1.197187e+02
    
    
      120
      WB
      Central African Republic
      IC.REG.COST.PC.ZS
      2.040000e+02
      1.197187e+02
    
    
      121
      WB
      Haiti
      IC.REG.COST.PC.ZS
      2.353000e+02
      1.197187e+02
    
    
      122
      WB
      Chad
      IC.REG.COST.PC.ZS
      1.504000e+02
      1.197187e+02
    
    
      123
      WB
      Djibouti
      IC.REG.COST.PC.ZS
      1.681000e+02
      1.197187e+02
    
    
      124
      WB
      Mauritania
      IC.FRM.DURS
      6.460000e+01
      3.418175e+01
    
    
      125
      WB
      Kuwait
      IT.CEL.SETS.P2
      2.317632e+02
      1.176603e+02
    
    
      126
      WB
      Hong Kong SAR, China
      IT.CEL.SETS.P2
      2.288316e+02
      1.176603e+02
    
    
      127
      WB
      Macao SAR, China
      IT.CEL.SETS.P2
      3.244408e+02
      1.176603e+02
    
    
      128
      WB
      Zambia
      IC.IMP.COST.CD
      7.060000e+03
      4.534090e+03
    
    
      129
      WB
      South Sudan
      IC.IMP.COST.CD
      9.285000e+03
      4.534090e+03
    
    
      130
      WB
      Congo, Rep.
      IC.IMP.COST.CD
      7.590000e+03
      4.534090e+03
    
    
      131
      WB
      Tajikistan
      IC.IMP.COST.CD
      1.065000e+04
      4.534090e+03
    
    
      132
      WB
      Uzbekistan
      IC.IMP.COST.CD
      6.452000e+03
      4.534090e+03
    
    
      133
      WB
      Chad
      IC.IMP.COST.CD
      9.025000e+03
      4.534090e+03
    
    
      134
      WB
      Zambia
      IC.EXP.COST.CD
      5.165000e+03
      3.381954e+03
    
    
      135
      WB
      South Sudan
      IC.EXP.COST.CD
      5.335000e+03
      3.381954e+03
    
    
      136
      WB
      Tajikistan
      IC.EXP.COST.CD
      9.050000e+03
      3.381954e+03
    
    
      137
      WB
      Afghanistan
      IC.EXP.COST.CD
      5.045000e+03
      3.381954e+03
    
    
      138
      WB
      Central African Republic
      IC.EXP.COST.CD
      5.490000e+03
      3.381954e+03
    
    
      139
      WB
      Uzbekistan
      IC.EXP.COST.CD
      5.090000e+03
      3.381954e+03
    
    
      140
      WB
      Kazakhstan
      IC.EXP.COST.CD
      5.285000e+03
      3.381954e+03
    
    
      141
      WB
      Chad
      IC.EXP.COST.CD
      6.615000e+03
      3.381954e+03
    
    
      142
      WB
      Belgium
      IC.TAX.LABR.CP.ZS
      4.940000e+01
      2.995499e+01
    
    
      143
      WB
      France
      IC.TAX.LABR.CP.ZS
      5.350000e+01
      2.995499e+01
    
    
      144
      WB
      Nigeria
      IC.ELC.OUTG
      3.280000e+01
      2.424460e+01
    
    
      145
      WB
      Papua New Guinea
      IC.ELC.OUTG
      4.190000e+01
      2.424460e+01
    
    
      146
      WB
      Bhutan
      DT.DOD.PVLX.GN.ZS
      6.815542e+01
      4.057852e+01
    
    
      147
      WB
      Cabo Verde
      DT.DOD.PVLX.GN.ZS
      6.770737e+01
      4.057852e+01
    
  

148 rows × 5 columns

Bin the data

As part of the analysis that follows several continous/discrete features were binned (as described below this is needed for association rule mining, feature selection etc.). Here we show one specific feature IC.BUS.EASE.XQ which quantifies the ease of doing business in a country (1 being highest, larger the value lower the ease of doing business) into a categorical value. Numeric values get mapped to categories as follows:

Range	Category
[1,11)	VeryHigh
[11,31)	High
[31,91)	Medium
[91, 131)	Low
[131, beyond	VeryLow

The benefits of binning this feature is for exploratory data analysis to answer questions like what % of Starbucks store are in countries with high or very high ease of business. The method used provides a simple strategy to bin the data, although it would require a subject matter expert to say what the most appropriate bin edges are.

Here is an excerpt from the WorldBank data showing the new column created as a result of binning along with the original feature (first 5 rows shown).



In [22]:

    
import pandas as pd
WB_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/WB_data_w_features.csv'
df = pd.read_csv(WB_URL)
df[['name', 'IC.BUS.EASE.XQ', 'Ease.Of.Doing.Business']].head()









    Out[22]:






  
    
      
      name
      IC.BUS.EASE.XQ
      Ease.Of.Doing.Business
    
  
  
    
      0
      Belgium
      43.000000
      Medium
    
    
      1
      Equatorial Guinea
      180.000000
      VeryLow
    
    
      2
      Early-demographic dividend
      117.868852
      Low
    
    
      3
      Small states
      107.354167
      Low
    
    
      4
      Bangladesh
      174.000000
      VeryLow

Handling missing values

There are not a lot of missing values in the Starbucks dataset (the adjusted data quality score for missing values is a 100% which means all the featurs that we care about have no missing values, see section on DQS above). For the WorldBank dataset has a lot of missing though, DQS for missing values is ~ 33.65%. The missing values in the WorldBank dataset are handled as follows:

The WorldBank data is updated every year and many times there is a timelag in the data being available for a year. Also since these indicators are macro level indicators so the year on year change is not drastic. As a first step, all the values that are missing for the latest year are filled with the values from the previous year whenever the previous year's data is applicable.
Define a feature density threshold of 90% and delete all features which are less than 90% filled.
Finally the analysis is to be performed on the combined dataset, as described above the combined dataset is obtained by joining the WorldBank and Starbucks datasets only for those countries that exist in the Starbucks dataset. This intersection results in a dataset which is pretty full (99.8%).
- Any missing values are handled by replacing them with the mean value of the feature.

The DQS metrics for the original datasets as well as the combined dataset after cleaning are shown below.



In [24]:

    
import pandas as pd
DQS_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/dqs/dqs_after_cleaning.csv'
df = pd.read_csv(DQS_URL)
#This is the DQS after Step 1 described above
df









    Out[24]:






  
    
      
      Datasource
      Combined_Score
      Invalid_Data_Raw_Score
      Invalid_Data_Adjusted_Score
      Missing_Data_Raw_Score
      Missing_Data_Adjusted_Score
    
  
  
    
      0
      WorldBank
      76.92776
      100.00000
      100.00000
      53.855519
      53.855519
    
    
      1
      Starbucks
      99.99981
      99.99962
      99.99962
      91.529921
      100.000000



In [25]:

    
import pandas as pd
DQS_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/dqs/dqs_combined_dataset.csv'
df = pd.read_csv(DQS_URL)
#This is the DQS after Step 3 described above (Combined dataset, actually used for analysis)
df









    Out[25]:






  
    
      
      dataset
      before_cleaing
      after_cleaning
    
  
  
    
      0
      WDI+SB
      61.20915
      99.812312

Data cleaning for the combined dataset

The combined dataset is cleaned for missing values as described in the previous section.

Missing values are handled as described above (take missing data from 2014, remove features for which density is less than 90%, many rows get removed when we combined the WorldBank data with Starbucks data).
There are no invalid values in the WorldBank dataset (see Data Issues and Data Quality Score sections above). The invalid values from the Starbucks dataset (some lat/long values were missing, names of some streets could not be read due to errors with unicode handling) do not impact the overall analysis and are therefore ignored, see adjusted DQS for invalid data.

Histograms and Correlations

This section provides histograms and corelations from the individual datasets where appropriate and also from the combined dataset used for analysis.

Histograms from WorldBank dataset

The following figure shows histogram of three different features from the WorldBank dataset.

The observations from these histograms are described below.

Feature	Description	Observations
IS.BUS.EASE.XQ	Ease of doing business index (1=most business-friendly regulations)	Almost uniformaly distributed except for the 100 to 150 range (indicating more concentration of countries in that range).
SP.URB.GROW	Urban population growth (annual %)	Almost normally distributed.
WP15163_4.1	Mobile account (% age 15+)	Almost looks like an EL (Exponential logarithmic distribution).

Histograms from the combined dataset

The following figure shows histogram of three different features from the Combined dataset. The value of the coorelation coefficent is also shown.

The observations from these histograms are described below.

Feature	Description	Observations
IT.NET.USER.P2	Internet users (per 100 people)	Looks like a skewed left distribution. Most countries with Starbucks stores have a very high percentage of internet users, which makes sense if one visualizes a typical coffee shop.
Num.Starbucks.Stores	Number of Starbucks stores.	Most countries (infact 69 out of 73 which have Starbucks) have kess thana 1000 stores, which makes sense but doesnt tell as much. A better representation would be to increase the number of bins in this historgram.
SP.PO.TOTAL	Total population	The outliers in this chart towards the extreme right probably represent India and China.
ST.IN.ARVL	International tourism, number of arrivals	Intrestingly, it appears that most Starbucks stores are not located in countries with very high tourist footfals.

Scatter plots and correlation coefficient WorldBank dataset

The following figure shows the scatter plot for three features in the WorldBank dataset.

From the scatter plots we can observe that there is a negative correlation between ease of doing business and number of Internet users in a country, but, since the ease of doing business is expressed in a way that lower the number greater the ease so therefore it follows that ease of doing business and number of Internet users are positively correlated.

Ease doing business is also positvely correlated to per capita GDP. There also seems to be a positive correlation between number of internet users and per capita GDP.

It is important to mention here, the oft repeated phrase correlation does not imply causation.

The following table shows the correlation coefficient between these features and the value of r corrobrates the observations from the scatter plot above.



In [26]:

    
import pandas as pd
WB_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/WDI_pearson_r.csv'
df = pd.read_csv(WB_URL)
df









    Out[26]:






  
    
      
      Dataset
      feature1
      feature2
      r
    
  
  
    
      0
      WB
      IC.BUS.EASE.XQ
      SL.GDP.PCAP.EM.KD
      -0.557920
    
    
      1
      WB
      IC.BUS.EASE.XQ
      IT.NET.USER.P2
      -0.831488
    
    
      2
      WB
      SL.GDP.PCAP.EM.KD
      IT.NET.USER.P2
      0.741431

Scatter plots and correlation coefficient the combined dataset

The following figure shows the scatter plot for three features in the WorldBank dataset.

From the scatter plots we can observe Number of Starbucks stores has somewhat of a positive non-linear relationship with the number of international tourist arrivals, total population and number of people having access to the Internet. This is further clarified when we do polynomial regression between number of Starbucks stores and number of international tourist arrivals.

The following table shows the correlation coefficient between these features and the value of r corrobrates the observations from the scatter plot above.



In [27]:

    
import pandas as pd
WB_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/Combined_r.csv'
df = pd.read_csv(WB_URL)
df









    Out[27]:






  
    
      
      Dataset
      feature1
      feature2
      r
    
  
  
    
      0
      combined
      Num.Starbucks.Stores
      IT.NET.USER.P2
      0.028446
    
    
      1
      combined
      Num.Starbucks.Stores
      ST.INT.ARVL
      0.521229
    
    
      2
      combined
      Num.Starbucks.Stores
      SP.POP.TOTL
      0.263139
    
    
      3
      combined
      IT.NET.USER.P2
      ST.INT.ARVL
      0.087930
    
    
      4
      combined
      IT.NET.USER.P2
      SP.POP.TOTL
      -0.324875
    
    
      5
      combined
      ST.INT.ARVL
      SP.POP.TOTL
      0.311971

Cluster Analysis

Cluster analysis was conducted to find patterns within data (from the combined dataset). Recall that the combined dataset is created by filtering out countries which do not have Starbucks presence, this brings a Starbucks specific dimension to this dataset. The presence of Starbucks stores (lets say in terms of number of stores in a country or any other feature derived from the Starbucks dataset) is the dependant variable in this analysis, therefore any patterns we find in the clusters that are created would be examined in that light i.e. do the cluster member show an affinity to a particular value of the dependant variable.

Before doing clustering a Principal Component Analysis (PCA) is first conducted to transform the data into 3 dimensions. PCA in 3 dimensions helps because the clusters can the be visualized in 3D space. Three different types of clustering mechanisms are used viz. KMeans, DBSCAN and Hierarchical (Agglomerative) clustering. This is described in detail in the next few sections, lets examine the PCA results first.

The following figure shows the result of PCA in 3 dimensions. Note that PCA only helps to spatially distinguish the data, but seeing the clusters without a spectral differentiation is difficult (which is what is provided by the clustering algorthms).

KMeans, DBSCAN and Heirarchical Clustering

The following figures show the result of 3 clustering techniques. It appeas that KMeans(k=5) provides the best custering.

Observations from clustering

The plots above indicate that KMeans has the best clusters, although it has to be said that the clusters are not very clear and spaced out even with KMeans which indicates that there may not be very clear patterns in the data.
Value of k=5 was choosen as it matches the bins that we want to create for classifying number of Starbucks stores as Very Low, Low, Medium, High and Very high.
Vectors at the center of the clusters (KMeans) are centroids and the points in a cluster are all placed around it.
The hierarchical and DBSCAN clustering produce similar results as can be observed just by visual inspection of the plots, the KMeans clusters seem to be much more well formed.

Specific observation from KMeans

The following table shows a filtered list of labels from each of the clustering algorithms attached to each entry in the dataset. The KMeans labels does seem to show a pattern in countries that are grouped together.

KMeans label=4 groups together U.S. and China in a cluster of their own,which should not be a surprise. Ofcourse we can see any number of similarities that could have caused this for example GDP, tourist arrival etc. Also from a Starbucks perspective, these countries represent very high number of Starbucks stores (top 2).
KMeans label=3 groups together Brazil, Russia and other countries of the erstwhile Russian federation. From a Starbucks perspective these countries represent very similar Starbucks store density (number of stores per 100,000 people).
KMeans label=2 groups together mostly Asian, African and some South American countries which have low to medium ease of business but medium to high population.

Other labels can also be explained in a similar way.



In [28]:

    
import pandas as pd
Combined_W_Labels_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/clustering/WDI_SB_w_labels.csv'
df = pd.read_csv(Combined_W_Labels_URL)
df[['name', 'KMeans_labels', 'SL.GDP.PCAP.EM.KD','Ease.Of.Doing.Business','ST.INT.ARVL.Categorical','SP.POP.TOTL.Categorical','Num.Starbucks.Stores','Starbucks.Store.Density']].head(10)









    Out[28]:






  
    
      
      name
      KMeans_labels
      SL.GDP.PCAP.EM.KD
      Ease.Of.Doing.Business
      ST.INT.ARVL.Categorical
      SP.POP.TOTL.Categorical
      Num.Starbucks.Stores
      Starbucks.Store.Density
    
  
  
    
      0
      Belgium
      0
      98644.171875
      Medium
      M
      M
      19
      0.168354
    
    
      1
      Trinidad and Tobago
      2
      61845.410156
      Medium
      VL
      L
      1
      0.073525
    
    
      2
      Oman
      3
      85342.617188
      Medium
      L
      L
      12
      0.267228
    
    
      3
      Aruba
      3
      65623.968183
      NaN
      VL
      VL
      3
      2.887697
    
    
      4
      Luxembourg
      0
      201747.796875
      Medium
      VL
      VL
      2
      0.351077
    
    
      5
      Lebanon
      1
      42497.589844
      Low
      L
      M
      29
      0.495664
    
    
      6
      El Salvador
      1
      18404.509766
      Medium
      VL
      M
      10
      0.163223
    
    
      7
      Romania
      3
      37818.410156
      Medium
      M
      H
      25
      0.126056
    
    
      8
      Costa Rica
      3
      30870.919922
      Medium
      M
      M
      11
      0.228792
    
    
      9
      Argentina
      1
      31734.800781
      Low
      M
      H
      105
      0.241842

Association Rules / Frequent Itemset Mining Analysis

Association rule mining is run on the combined dataset to find tuples of various feature values that occur together in the dataset and draw inferences from that. Most of the data in the combined dataset is continous or discrete, to run association rule mining the first step that is needed is to convert data into categorical values. The combined dataset contains 40 odd features, so instead of converting the entire dataset to categorical, some selected features are converted and association rule mining is run for those features.

Continous/discrete features are converted into categorical by binning them by percentile values as described in previous sections. The dataset for this particular problem does not lend itself very well for this type of analysis (because large number of features are numeric). The conversion of numeric data to categorical in the context of this dataset is an art rather than a science (meaning what bin sizes are good for categorizing number of stores as low, medium or high, this is subjective) therefore we can get vastly different results by changing the criteria for converting numeric data to categorical (have 3 bins instead of 5 for example). Regression and classification algorithms discussed later are better suited for this data science problem.

To select which features would have a relationship with dependant variables that we want to predict we draw scatter plots of the dependant feature Vs all numeric features in the dataset and manually examine where a correlation is seen. Another technique used for finding out the most important feature using a machine learning technique (ExtraForestClassifier) and then using the top two most important features for conversion to categorical. The scatter plots are not being included in this report allthough they are available as part of the overall package that has been submitted. Similarly the list of important features can be seen as part of the logs generated on running the program.

Two different types of rule mining is done, one with a rule containing a single variable predicting the dependant variable and another one with two variables predicting the dependant variable.

Association rules with 2 features

The following table identifies all th association rules with two variables:

Feature	Description	Categorical values
Num.Starbucks.Stores.Categorical	Number of Starbucks stores, categorical	VL, L, M, H, VH
ST.INT.ARVL.Categorical	International tourist arrivals, categorical	VL, L, M, H, VH

The expectation is that very high influx of tourists should occur together with very high number of Starbucks stores. This does occur as this is the second most frequent rule as shown in the table below, however the support value for this rule is a low 0.125 and the confidence is 0.64. The rule that occurs most frequently is medium tourist influx corresponds to medium Starbucks stores, this rule occurs 15 times and has a support of 0.20 and confidence of 0.53. Expected to have a greater support level for the more frequent rules but this not happening probably because of two reasons a) there are only 72 countries with Starbucks stores so there is not a lot data and b) the conversion of numeric to cateogrical data in this case is not an exact science and can be tweaked to give better results (reduce the number of categories for example).

This is all summarized in the following table. The rules themselves are written in the following format R:({A})->B means if A is present then B is also present (with the frequency, support and confidence mentioned alongside to give a sense of probability).



In [29]:

    
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_2_features.csv'
df = pd.read_csv(ASSOC_RULES_URL)
df.sort_values(by='frequency', ascending=False).head()









    Out[29]:






  
    
      
      rule
      frequency
      support
      confidence
    
  
  
    
      0
      R:({SBS_M})->TOURIST_ARR_M
      15
      0.208333
      0.535714
    
    
      18
      R:({TOURIST_ARR_M})->SBS_M
      15
      0.208333
      0.535714
    
    
      29
      R:({TOURIST_ARR_VH})->SBS_VH
      9
      0.125000
      0.600000
    
    
      12
      R:({SBS_VH})->TOURIST_ARR_VH
      9
      0.125000
      0.600000
    
    
      19
      R:({TOURIST_ARR_M})->SBS_H
      7
      0.097222
      0.250000

### Association rules with 2 features filtered by 3 different support levels The following tables present association rules filterd by 3 different support levels. 1. support >= 0.05 and confidence >= 0.25 2. support >= 0.07 and confidence >= 0.25 3. support >= 0.1 and confidence >= 0.25



In [30]:

    
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_2_features_min_support_0.05_and_min_confidence_0.25.csv'
df1 = pd.read_csv(ASSOC_RULES_URL)
df1









    Out[30]:






  
    
      
      rule
      frequency
      support
      confidence
    
  
  
    
      0
      R:({SBS_M})->TOURIST_ARR_M
      15
      0.208333
      0.535714
    
    
      1
      R:({SBS_VL})->TOURIST_ARR_VL
      5
      0.069444
      0.555556
    
    
      2
      R:({SBS_H})->TOURIST_ARR_M
      7
      0.097222
      0.500000
    
    
      3
      R:({SBS_H})->TOURIST_ARR_VH
      4
      0.055556
      0.285714
    
    
      4
      R:({SBS_VH})->TOURIST_ARR_VH
      9
      0.125000
      0.600000
    
    
      5
      R:({SBS_VH})->TOURIST_ARR_H
      4
      0.055556
      0.266667
    
    
      6
      R:({TOURIST_ARR_M})->SBS_M
      15
      0.208333
      0.535714
    
    
      7
      R:({TOURIST_ARR_M})->SBS_H
      7
      0.097222
      0.250000
    
    
      8
      R:({TOURIST_ARR_VL})->SBS_VL
      5
      0.069444
      0.625000
    
    
      9
      R:({TOURIST_ARR_L})->SBS_M
      4
      0.055556
      0.571429
    
    
      10
      R:({TOURIST_ARR_VH})->SBS_VH
      9
      0.125000
      0.600000
    
    
      11
      R:({TOURIST_ARR_VH})->SBS_H
      4
      0.055556
      0.266667
    
    
      12
      R:({TOURIST_ARR_H})->SBS_M
      6
      0.083333
      0.428571
    
    
      13
      R:({TOURIST_ARR_H})->SBS_VH
      4
      0.055556
      0.285714



In [31]:

    
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_2_features_min_support_0.07_and_min_confidence_0.25.csv'
df2 = pd.read_csv(ASSOC_RULES_URL)
df2









    Out[31]:






  
    
      
      rule
      frequency
      support
      confidence
    
  
  
    
      0
      R:({SBS_M})->TOURIST_ARR_M
      15
      0.208333
      0.535714
    
    
      1
      R:({SBS_H})->TOURIST_ARR_M
      7
      0.097222
      0.500000
    
    
      2
      R:({SBS_VH})->TOURIST_ARR_VH
      9
      0.125000
      0.600000
    
    
      3
      R:({TOURIST_ARR_M})->SBS_M
      15
      0.208333
      0.535714
    
    
      4
      R:({TOURIST_ARR_M})->SBS_H
      7
      0.097222
      0.250000
    
    
      5
      R:({TOURIST_ARR_VH})->SBS_VH
      9
      0.125000
      0.600000
    
    
      6
      R:({TOURIST_ARR_H})->SBS_M
      6
      0.083333
      0.428571



In [32]:

    
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_2_features_min_support_0.1_and_min_confidence_0.25.csv'
df = pd.read_csv(ASSOC_RULES_URL)
df









    Out[32]:






  
    
      
      rule
      frequency
      support
      confidence
    
  
  
    
      0
      R:({SBS_M})->TOURIST_ARR_M
      15
      0.208333
      0.535714
    
    
      1
      R:({SBS_VH})->TOURIST_ARR_VH
      9
      0.125000
      0.600000
    
    
      2
      R:({TOURIST_ARR_M})->SBS_M
      15
      0.208333
      0.535714
    
    
      3
      R:({TOURIST_ARR_VH})->SBS_VH
      9
      0.125000
      0.600000

Association rules with 3 features

The following table identifies all th association rules with two variables:

Feature	Description	Categorical values
Num.Starbucks.Stores.Categorical	Number of Starbucks stores, categorical	VL, L, M, H, VH
ST.INT.ARVL.Categorical	International tourist arrivals, categorical	VL, L, M, H, VH
SP.POP.TOTL.Categorical	Total population, categorical	VL, L, M, H, VH

The expectation is that high influx of tourists and a high local population should occur together with high number of Starbucks stores. This does occur as this is the second most frequent rule as shown in the table below, however the support value for this rule is a very low 0.08 but the confidence is 0.75. The rule that occurs most frequently is medium tourist influx in a country with medium population corresponds to medium Starbucks stores, this rule occurs 7 times and has a support of 0.09 and confidence of 0.53. Expected to have a greater support level for the more frequent rules but this not happening probably because of two reasons a) there are only 72 countries with Starbucks stores so there is not a lot data and b) the conversion of numeric to cateogrical data in this case is not an exact science and can be tweaked to give better results (reduce the number of categories for example).

This is all summarized in the following table. The rules themselves are written in the following format R:({A,B})->C means if A and B present together does imply C (with the frequency, support and confidence mentioned alongside to give a sense of probability).



In [33]:

    
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_3_features.csv'
df = pd.read_csv(ASSOC_RULES_URL)
df.sort_values(by='frequency', ascending=False).head()









    Out[33]:






  
    
      
      rule
      frequency
      support
      confidence
    
  
  
    
      0
      R:({POPT_M,TOURIST_ARR_M})->SBS_M
      7
      0.097222
      0.538462
    
    
      35
      R:({POPT_VH,TOURIST_ARR_VH})->SBS_VH
      6
      0.083333
      0.750000
    
    
      9
      R:({POPT_M,TOURIST_ARR_H})->SBS_M
      4
      0.055556
      0.500000
    
    
      20
      R:({POPT_VL,TOURIST_ARR_VL})->SBS_VL
      4
      0.055556
      0.800000
    
    
      1
      R:({POPT_M,TOURIST_ARR_M})->SBS_H
      3
      0.041667
      0.230769

Association rules with 3 features filtered by 3 different support levels

The following tables present association rules filterd by 3 different support levels.

support >= 0.05 and confidence >= 0.25
support >= 0.07 and confidence >= 0.25
support >= 0.1 and confidence >= 0.25



In [34]:

    
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_3_features_min_support_0.05_and_min_confidence_0.25.csv'
df1 = pd.read_csv(ASSOC_RULES_URL)
df1









    Out[34]:






  
    
      
      rule
      frequency
      support
      confidence
    
  
  
    
      0
      R:({POPT_M,TOURIST_ARR_M})->SBS_M
      7
      0.097222
      0.538462
    
    
      1
      R:({POPT_M,TOURIST_ARR_H})->SBS_M
      4
      0.055556
      0.500000
    
    
      2
      R:({POPT_VL,TOURIST_ARR_VL})->SBS_VL
      4
      0.055556
      0.800000
    
    
      3
      R:({POPT_VH,TOURIST_ARR_VH})->SBS_VH
      6
      0.083333
      0.750000



In [35]:

    
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_3_features_min_support_0.07_and_min_confidence_0.25.csv'
df1 = pd.read_csv(ASSOC_RULES_URL)
df1









    Out[35]:






  
    
      
      rule
      frequency
      support
      confidence
    
  
  
    
      0
      R:({POPT_M,TOURIST_ARR_M})->SBS_M
      7
      0.097222
      0.538462
    
    
      1
      R:({POPT_VH,TOURIST_ARR_VH})->SBS_VH
      6
      0.083333
      0.750000



In [36]:

    
import pandas as pd
ASSOC_RULES_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/association_rules/association_rules_3_features_min_support_0.1_and_min_confidence_0.25.csv'
df1 = pd.read_csv(ASSOC_RULES_URL)
df1









    Out[36]:






  
    
      
      rule
      frequency
      support
      confidence

Predictive Analysis

This section describes three hypothesis that have been developed using the combined WorldBank and Starbucks dataset and then tests each of the three hypothesis using parameteric and/or predictive algorithms.

Hypothesis 1: No difference in the average number of Starbucks stores in countries with very high number of international tourists Vs rest of the world.

This hypothesis is tested in two ways.

Two sample T-test: A two sample T-test is used to find out if the means of two independant samples are different. The null hypothesis is that means of both the samples are the same. The two sample t-test only relies on sample means and does not need to know the population mean.
Linear and Polynomial regression: ordinary least squares regression. Linear regression tries to predict the value of the dependant variable by using a linear combination of explanatory variables. Linear regression could be univariate (one explanatory variable) or multi variate. A polynomial regression tries to model the dependant variable using higher powers of the explanatory variable(s).

Using the two sample T-test for hypothesis 1

The combined dataset is used to filter out entries belonging to 'ST.INT.ARVL.Categorical' equal to 'VH' (for very high) and then another set of entries corresponding to 'ST.INT.ARVL.Categorical' value other than 'VH' so as to include the rest of the world. Python scipy module is used to calculate a T statistic and a corresponding p-value when comparing the two distributions. The p-value comes out to be ~0.17. This is a unusually large value and with a typical T-test alpha value of 0.05 it would not be possible to reject the null hypothesis but in this case since the distribution of the number of stores was assumed to be normal when it is known that it is not normal (see histogram in previous sections) so a higher alpha of 0.20 is choosen and this enables the null hypothesis that there is no difference to be rejected. This is in line with what is seen in the scatter plots for these two features.



In [37]:

    
import pandas as pd
T_TEST_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/regression/t_test_results.csv'
df1 = pd.read_csv(T_TEST_URL)
df1









    Out[37]:






  
    
      
      Hypothesis
      No difference between average number of Starbucks stores in countries with Very high and high number of international tourist Vs Rest of the world
    
  
  
    
      0
      T-statistic, p-value
      1.452938,0.168245

Using Linear and Polynomial regression for hypothesis 1

In this method instead of using the categorical value of the number of international tourist arrivals the actual discrete value is used and it provided to a linear regression model and a polynmial regression model to predict the value of the number of Starbucks stores. The linear model provides an explained variance of 0.27 whereas the polynomial model performs much better by providing an exaplined variance of 0.33. Ideally the the explained variance should be close to 1. It is a matter of further investigation as to what else can be done to make the prediction better. The following scatter plots show linear and polynomial regression models in action, as can be seen from the plots the polynomial model provides a better fit. The polynomial model is a degree 3 polynomial (other degrees were also tried and 3 provided the best results). The outlier value seen towards the stop right is the number of stores in te U.S., it is an extreme value and further investigation would need to be done to figure out if a single variable can predict such a huge jump in value.

Hypothesis 2: The number of Startbucks stores in a country as a categorical value can be predicted using some WDI indicators

This hypothesis states that the number of Starbucks stores in a country expressed as a categorical value (Very High, High, Medium, Low, Very Low) can be predicted using WDI (World Development Indictors) from the WorldBank data. This is verified by running various classification algorithms and then selecting the on that provides the highest accuracy value as measured via cross validation.

The combined dataset contains 35 WDI features. All of these are not used for classification, the most important 15 features are choosen. The Extra Trees classifier which uses an ensemble of decision trees is used to provide a ranking for all the features and then the top ranked 15 features are used as inputs for the classification algorithms. Tests were done with 5, 10, 15, 30 and all 35 features as well, 15 most important features provided the best results in terms of accuracy.

The 15 most important features for determining the number of stores value as a categorical are:

Feature	Description
TX.VAL.TECH.CD	High-technology exports (current US dollars)
ST.INT.ARVL	International tourism, number of arrivals
SP.POP.TOTL	Total Population
BG.GSR.NFSV.GD.ZS	Trade in services (percentage of GDP)
SL.GDP.PCAP.EM.KD	GDP per person employed (constant 2011 PPP dollars)
IC.LGL.CRED.XQ	Strength of legal rights index (0=weak to 12=strong)
BX.KLT.DINV.WD.GD.ZS	Foreign direct investment, net inflows (percentage of GDP)
IC.EXP.COST.CD	Cost to export (US$ per container)
NE.CON.PETC.ZS	Household final consumption expenditure, etc. (percentage of GDP)
IC.REG.DURS	Time required to start a business (days)
TX.VAL.OTHR.ZS.WT	Computer, communications and other services (percentage of commercial service exports)
IC.REG.PROC	Start-up procedures to register a business (number)
IC.BUS.EASE.XQ	Ease of doing business
SH.STA.ACSN.UR	Improved sanitation facilities, urban (percentage of urban population with access)
IT.NET.USER.P2	Internet users (per 100 people)

The above list is a very intresting list because all of these indicators seem to appear as very important factors that a business would consider. A lot of these are either economic or directly related to business activity so it intutively makes sense that the machine learning algorithm was choosing the right indicators.

Once the indicators were selected all of the following classification algorithms were applied on the data and the results in terms of accuracy and standard deviation of a K-fold cross validation was noted. Random forest turned out to be the best classifier in terms of accuracy.

The accuracy turned out to be in the 50 to 60 percent range but this is just the first pass at predicting. The indicators, how they are used, and finally the bins for the classification of the dependant variable, all are subject to be refined and this is not the final score. The hypothesis is considered valid although the accuracy value is lower than desirable. More investigation needs to be performed to bring the accuracy score up around the 80 percent range.

Description of classification algorithms used

All of the following algorithms were used for classification. All of these algorithms are available via the Scikit learn python package. A test train split was done using training data fraction as 20%. A 10-fold cross validation was done to determine the accuracy of various algorithms.

Algorithm	Description	Important notes	Results (Accuracy (standard deviation)
kNN	K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).	Used scikit learn's defaults.	0.476667 (0.164688)
CART	Classification and Regression Trees. Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches. Leaf node represents a classification or decision. The top most node is called root and is also the best predictor.	Used scikit learn's defaults.	0.383333 (0.188709)
Naive Bayes	The Naive Bayesian classifier is based on Bayes’ theorem with independence assumptions between predictors. A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for very large datasets.	Used scikit learn's defaults. Does performn well alongiwth kNN and RandomForest for the combined dataset.	0.493333 (0.221008)
SVM	A Support Vector Machine (SVM) performs classification by finding the hyperplane that maximizes the margin between the two classes. The vectors (cases) that define the hyperplane are the support vectors.	Used scikit learn's defaults (rbf kernel is the default, tried with other kernels all give the same result). Does not perform well for the dataset.	0.363333 (0.210000)
RandomForest	A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).	Used scikit learn defaults. Provides the best results.	0.510000 (0.201136)

Hypothesis 3: The number of Starbucks stores (discrete) in a country can be predicted using some WDI indicators.

This hypothesis stats that number of Starbucks stores in a country which is a discrete variable (as opposed to the categorical considered in hypothesis 2) can be predicted using some WDI indicators. The method used here is similar to hypothesis 2 in that first a set of three most important features are determined using the Extra Trees classifier and then these three features are fed into a Multivariate Polynomial Regression model to predict the dependant variable i.e. number of Starbucks store in a country. The three features used for this prediction are 'ST.INT.ARVL' (international tourist arrivals), 'TX.VAL.TECH.CD' (High-technology exports (current US dollars)), 'SP.POP.TOTL' (population of the country).

A multivariate polynomial model is choosen since there are more than one explanatory variables and the relationship with the dependant variable is definitely not linear as is known from the scatter plots (not included here but available as part of the package). A degree 3 polynomial works best as determined by experimenting with other values as well. An explained variance score of 0.99 is achieved, however the mean squared error is still very high at 27063. The MSE needs to be improved before the hypothesis can be considered valid, this needs further investigation in terms of feature selection and model used for prediction.

Other Hypothesis

A couple of other hypothesis were also tested as part of this phase.

Whether ownership type of Starbucks store ina country should be mixed (Starbucks owned, franchisee owned or other) or just Starbucks owned. All classification models were tried and it was found that Random Forest and kNN both predict this with about 84% accuracy (stddev 12%).

Bringing it all together [Project1 + Project2]

At this time in the data science lifecycle a clean combined dataset exists and has been used to validate multiple hypothesis as described above. The focus of this phase was getting everything in place to run predictive algorithms, the next phase will focus on improving the accuracy of the predictions. The results of the predictions are included in the overall package (see regression folder) but are not being disucssed here at this time.

What does appear is that more than a regression problem the problem can be modeled as more of a clustering and classification problem. In other words, use clustering algorithms (like KMeans as described above) to see if countries which have no Starbucks stores appear in which clusters, for example if they appear in clusters corresponding say countries with a high number of Starbucks store then that is a prediction. Similarly, the RandomForest or NaiveBayes classifier (which provided between 50 to 61% accuracy) could be used to predict the category (low, medium, high etc) for number of Starbucks stores.

Another aspect to review is that maybe the categories for the number of Starbucks stores need to be reduced from 5 (Very high, high, medium, low, very low) to just 3 (high, medium, low) to improve the accuracy of predictions. These are all things to be considered for the next phase of the project.

Predictions based on Classification

As discussed in the previous section, it appears that classification is a better option for prediction than regression for this data science problem. In other words, it is possible to predict the number of Starbucks stores in a country as a categorical variable (i.e. VeryLow, Low, Medium, High, Very High) with a much higher degree of accuracy than predicting the exact number as a continous variable. The RandomForests classifier discussed above provided an accuracy of between 50 to 61% (stddev 12 to 20%), it is used to predict the number of stores in countries which do not have a Starbucks presence yet but for which there is data available for all the features to enable prediction. The features used for prediction are listed in Hypothesis 2 and are not being repeated here. Since the original WorldBank data has a lot of missing data therefore predictions cannot be made for all countries where Starbucks does not have a presence yet. The following table lists the results of the prediction. (VL: Very Low, L: Low, M: Medium, H: High, VH; Very High). The categories translate into the following numerical values:

Range	Category
[1,3]	Very Low
[4,6]	Low
[7,41]	Medium
[42, 128]	High
[129, beyond	Very High



In [1]:

    
import pandas as pd
RESULTS_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/classification/Num.Starbucks.Stores.Categorical_predictions.csv'
df = pd.read_csv(RESULTS_URL)
df









    Out[1]:






  
    
      
      Num.Starbucks.Stores.Categorical
      country_codes
      names
    
  
  
    
      0
      H
      1W
      World
    
    
      1
      H
      XC
      Euro area
    
    
      2
      H
      XF
      IBRD only
    
    
      3
      H
      Z7
      Europe & Central Asia
    
    
      4
      H
      T7
      Europe & Central Asia (IDA & IBRD countries)
    
    
      5
      H
      XJ
      Latin America & Caribbean (excluding high income)
    
    
      6
      H
      T2
      Latin America & the Caribbean (IDA & IBRD coun...
    
    
      7
      H
      EC
      Ecuador
    
    
      8
      H
      XH
      IDA blend
    
    
      9
      H
      XT
      Upper middle income
    
    
      10
      L
      BF
      Burkina Faso
    
    
      11
      L
      UG
      Uganda
    
    
      12
      L
      PY
      Paraguay
    
    
      13
      L
      MN
      Mongolia
    
    
      14
      M
      LV
      Latvia
    
    
      15
      M
      DO
      Dominican Republic
    
    
      16
      M
      HK
      Hong Kong SAR, China
    
    
      17
      M
      DZ
      Algeria
    
    
      18
      M
      IT
      Italy
    
    
      19
      M
      EE
      Estonia
    
    
      20
      M
      AM
      Armenia
    
    
      21
      M
      LK
      Sri Lanka
    
    
      22
      M
      MU
      Mauritius
    
    
      23
      M
      HR
      Croatia
    
    
      24
      M
      MD
      Moldova
    
    
      25
      M
      GY
      Guyana
    
    
      26
      M
      GE
      Georgia
    
    
      27
      M
      BJ
      Benin
    
    
      28
      M
      AL
      Albania
    
    
      29
      M
      MK
      Macedonia, FYR
    
    
      30
      M
      ZQ
      Middle East & North Africa
    
    
      31
      M
      LT
      Lithuania
    
    
      32
      M
      ME
      Montenegro
    
    
      33
      M
      XU
      North America
    
    
      34
      M
      NP
      Nepal
    
    
      35
      M
      UA
      Ukraine
    
    
      36
      M
      B8
      Central Europe and the Baltics
    
    
      37
      M
      RW
      Rwanda
    
    
      38
      M
      JM
      Jamaica
    
    
      39
      M
      SI
      Slovenia
    
    
      40
      VH
      XD
      High income
    
    
      41
      VH
      OE
      OECD members
    
    
      42
      VH
      EU
      European Union
    
    
      43
      VL
      HN
      Honduras
    
    
      44
      VL
      SL
      Sierra Leone
    
    
      45
      VL
      SR
      Suriname
    
    
      46
      VL
      IS
      Iceland
    
    
      47
      VL
      MZ
      Mozambique
    
    
      48
      VL
      TZ
      Tanzania
    
    
      49
      VL
      7E
      Europe & Central Asia (excluding high income)
    
    
      50
      VL
      IL
      Israel
    
    
      51
      VL
      UY
      Uruguay
    
    
      52
      VL
      NI
      Nicaragua
    
    
      53
      VL
      BA
      Bosnia and Herzegovina

Discussion of the prediction results

The prediction throws up some interesting results. While some interpretation is being provided here, it would upto a subject matter expert (and well, Starbucks) to say why these groupings make sense (or not). A point to note here (which has not been discussed earlier) is that the WolrdBank data also had data for groups of countries (such as OECD) combined as a single entity so these groups also appear in the following classification.

Category	Notes
Very High	High income countries, OECD countries (Switzerland, Turkey, U.S., U.K.) and the EU show up in this category. This makes intuitive sense: high income, developed nations have a very high number of Starbucks stores. An interesting result here is the OECD countries group (U.S., U.K., Turkey, Switzerland), since a Very High or High number of Starbucks already exists in these countries, so this speaks to the accuracy of the model.
High	Several groups show up in this category. The world overall is classified as a High Starbucks region.
Very Low	This group consists of some African, South American countries. It is a matter of study to say why these countries are clustered togehter. Israel and Iceland could be an outlier in this list.
Medium	This is the single largest group consisting of 29 countries from all over the world. Italy shows up in this list, whereas one would expect that there would already be Starbucks in Italy. This list can be combined with other aspects (see visualizations) such as tourist inflow to prioritize which countries should be higher on the list for Starbucks to enter. A few countries worth mentioning from this list are Sri Lanka, Mauritius, Jamaica and European countries like Italy, Albania, Lithuania, Montenegro, Slovenia etc.
Low	This is a small list, couple of African, one Asian and one South American country.

Time Series Analysis for number of Starbucks stores [Additional Analysis 1]

A piece of information available in the Starbucks dataset is the date on which a store was first seen and the dataset contains information from December 2013 to the current month i.e. November 2016. The first seen field can be used to identify (say) a month on month increase in the number of stores i.e. stores for which the first seen date is say a date in January 2014 can all be considered as newly opened stores in January 2014 (this was also empirically verified by the fact that a new store that opened in Clarksburg, Maryland that the author is personally aware of shows up with a first seen date of November 2016).

This information about the first seen date can be used to do a time series analysis to examine and predict the number of Starbucks stores in the next 12 months. To do this the following steps were followed:

Extract a time series of interest i.e. total number of Starbucks stores across the world and plot it to visually examine how the number of stores was increasing over time.
Examine the time series for being stationary i.e. is the mean of the series increasing with time. The idea behind requiring the time series to be stationary is that if the time series is stationary then individual points can be considered I.I.D (independent identically distributed) and the usual concepts of stationary random variables (Central Limit Theorem, Law of Large Numbers) can be applied to it. If the series is not stationary then it needs to be made (read transformed into a) stationary series. There are several transformations to stationarize a series such as Logarithmic, Deflation by CPI, first difference etc.
- From visual examination of the number of Starbucks stores time series it is clear that the series is not stationary (number of stores is increasing with time). A Dickey-Fuller test could also have been done to check for stationarity but it has not been done as part of this project. The logarithmic transformation was used to stationarize the time series.
- Seasonal decomposition of the Starbucks stores time series is also done to extract out trend and seasonal variations. Plots are presented below.
The Auto Regressive Integrated Moving Average (ARIMA) model is used to forecast the timeseries. In general, ARIMA models are applied in some cases where data show evidence of non-stationarity (like in case of the Starbucks store data), where an initial differencing step (corresponding to the "integrated" part of the model) can be applied to reduce the non-stationarity. See https://www.otexts.org/fpp/8/1 for more details about time series and https://people.duke.edu/~rnau/411arim.htm for an explanation on ARIMA.
- The parameters (p,d,q) of the non-seasonal ARIMA model are are non-negative integers, p is the order (number of time lags) of the autoregressive model, d is the degree of differencing (the number of times the data have had past values subtracted), and q is the order of the moving-average model. The (p,d,q) values are generally identified by studying the ACF/PACF values, in this case, they were identified by trial and error (the values that gave the least mean square error were used).
- A Seasonal ARIMA (SARIMA) could have been tried but current stable release of the statsmodels package (0.6.1) does not support it, although it is available in a version currently under development (0.7.1).
The ARIMA model is provided the logged timeseries for reasons discussed above. The fitted values provided by the model are actually the differences of the logged series, these values are then cumulatively summed to derive the actual logged series. Once the predicted logged series is obtained then it is transformed back to its original value by taking an exponent (e^x) of the logged value. The forecasted values for future timestamps is obtained using the same procedure.

The following sections discuss time series analysis for two timeseries, one for the total number of stores across the world and another one for the number of stores in the state of New York. Several other TSA analysis (such as time series for number of stores in the U.K, India, the continent of Africa) are included in the project submission.

TSA for Starbucks Stores across the World

There are four plots provided below.

Seasonal decomposition: clearly shows that there is an almost linear upwards trend as well as a seasonal element (a frequency value of 5 was used).
First difference of the logged difference along with the fitted value from the model.
Original series with the fitted values (transformed back from the logged differences provided by the model). Also shown here is the root mean squared error (RMSE) to determine how good the prediction is. It is seen that the RMSE is ~152 which is roughly 0.60% of the total number of stores (25090), so the fitted value is off by less than 1% of the original value.
Original series with fitted value and projected values. The projected values are for the next 12 month period i.e. from January 2017 to December 2017. The projected number of Starbucks stores in the World at the end of 2017 should be around 27995.

TSA for Starbucks Stores in the state of New York

Same four plots as above are provided for the timer series analysis for number of stores in the state of New York as well. The RMSE is again less than 1%. Forecasted number of stores in the state of New York at the end of 2017 is ~692, which would be an addition of 53 stores from the current number of 639.

Exploring distribution of Starbucks stores across the world [Additional Analysis 2]

This analysis explores the distribution of the number of Starbucks stores in countries which have a Very High density of Starbucks stores (> 1 Starbucks store per 100,000 people) and at least 100 Starbucks stores across the country. The number of stores per city in the country is modeled as a random variable and its distribution is examined. The following table provides the list of countries that have been studied, note that Singapore which has a 125 stores across the country is not included in this list since there are only two Cities in Singapore one of which has a single Starbucks store.



In [3]:

    
import pandas as pd
RESULTS_URL = 'https://raw.githubusercontent.com/aarora79/sb_study/master/output/EDA/more/cities_withmost_stores.csv'
df = pd.read_csv(RESULTS_URL)
df









    Out[3]:






  
    
      
      country
      city_with_most_starbucks_stores
      count_in_city_with_most_stores
    
  
  
    
      0
      United States
      New York
      231
    
    
      1
      United Kingdom
      London
      195
    
    
      2
      UAE
      Dubai
      75
    
    
      3
      Kuwait
      مدينة الكويت
      74
    
    
      4
      Republic of Korea
      Seoul
      234
    
    
      5
      Canada
      Toronto
      191

From the above table it is clear that the maximum number of stores occurs in the captial city of the country except in the case of the U.S. where the maximum number of stores in a single city is in New York (which is not a surprise).

The following boxplots and histogram show the distribution of Starbucks stores across cities in the above mentioned countries. As is clear from these plots, the distribution is right skewed with very large difference between the third quartile and the maximum (because the one city with the maximum number of Stores is an outlier).

In order to obtain any insight from the above plots the outliers need to be removed. This is done by "Winsorizing" (see https://en.wikipedia.org/wiki/Winsorizing) the distribution at 0.05 level i.e. all greater than the 95th percentile are clamped at the 95th percentile value. The following are the plots after Winsorization. As seen from the plots it is evident that the basic nature of the plots has not changed after Winsorization but at least now it is clear that median number of stores per city for all of these countries is around 3 and the 3rd quartile is less than 7. The shape of the distribution as seen from the histograms is similar for all of these countries with the U.S. having the maximum number of stores in almost all percentiles. The shape of the distribution is discussed in the next section.

Looking at the histogram above, this distribution looks similar to an exponential distribution or at least something from that family of distributions. To understand this better, the distributions for each of these countries was examined individually and it was found by visual inspection that a Exponential Logarithmic distribution or an Erlang distribution fitted this best. The Seaborne package for making plots also provides the option to define a "fit" function and then it overlays the fitted plot on the actual data.

The following plots show the CDF (cumulative distribution function) plots as well as the histograms with the fitted curve from the Erlang distribution. The plots provided here are for the U.S. and the U.K. (rest of the plots are included in the submission package). The Erlang distribution is used to model incoming calls in a telephone system, waiting times in queuing , business economics for describing interpurchase times etc. In the context of this data science problem, we could think of applying this distribution to model number of Starbucks stores in cities of a country where Starbucks does not exist today. The distribution could define the number of stores in the capital city (which would be an outlier) and number of stores in other cities in the country which would be within the 1st quartile, 3rd quartile and so on. This could be used in conjunction with the previous analysis where the number of Starbucks stores is predicted as a categorical variable, so from there a range for the total number of stores in the country is known and this range could then be modeled using the Erlang distribution to distribute the total number of stores across multiple cities. This is a topic for further research.

Visualizations

This sections provides several visualizations done as part of this project to better understand the data at hand. These visualizations are discussed individually in the subsequent sections.

Starbucks around the world [Visualization 1, Bokeh]

The following visualization shows all Starbucks stores as of November 2016, overlayed over a world map. The brown dots on the map represent locations of Starbucks stores. A hover over individual point brings a popout text which provides the name of location where the store exists along with the name of the city. Since the map is displayed from a certain zoom level therefore the points representing Starbucks stores appear very close to each other in high density areas.

The visualization clearly shows that a lot of Starbucks stores are located on the eastern and western coast of the US, the UK in Europe and China and Japan in Asia as well as Mayasia and Indonesia in south east Asia. There are relatively fewer stores in the South America and extremely few stores in Africa.

This visualization has been created using GMapPlot and Bokeh.



In [15]:

    
from IPython.core.display import display, HTML
import requests as req
filename='locations.html'
display(HTML(filename))









    








    
        
        Bokeh Plot
        

        


        
    
    
        
        
            
  
  
  
  
    


    

    
    
    
    
    

    inspect Hover Tool

Starbucks in the U.S. [Visualization 2, SVG]

With more than 13,000 Starbucks stores, the U.S. has the most Starbucks stores in the World by a huge margin, but how are Starbucks stores distributed within the US? This is what this next visualization tries to explore by creating a heatmap of U.S. counties, more the number of stores, more intense the color of the map. Doing a mouse over a county brings up the name of the county and the number of stores in that county. The stores are expectedly most on the east coast and the west coast and very few in the states at the geographical center of the country like in Nebraska, South Dakota, Kansas etc.

Maximum number of stores, 665, occur in Los Angeles county in California. New York county (234) in New York state, Maricopa county (334 stores) in Arizona, Cook county (325 stores) in Illinois, King county (336 stores) in Washington are some of the counties with a very high number of Starbucks stores. Just from visual examination of the graphic it appears that the Ney York county (which includes Manhattan) would probably have the highest density of Starbucks stores in terms of number of stores divided by the size of the county.

This plot has been created using a combination of an SVG (Support Vector Graphics) image of the US map with counties marked on it and the FCC (Federal Communications Commission) API that provides the name of the county given a locations latitude and longitude. All Starbucks locations are mapped to a county name using the FCC API in an offline process and a count is maintained for the number of stores in each county. Once the store to county mapping is complete the result is stored in a CSV file. This CSV file is then used to assign color and hover text to the counties as listed in the SVG file. The counties are colored using shades in a single color pallete with 0 (no Starbucks stores) in the county being the lightest to greater than 300 stores being the darkest.

Number of stores	Color Intensity (increases with number of stores)
0	Lightest
1
[2, 15)
[16,150)
[151, 300)
(301 and greater)	Darkest



In [13]:

    
import IPython
fname = 'us_counties_starbucks_stores.html'
iframe = '<iframe src=' + fname + ' width=1000 height=550 scrolling=no frameBorder=0></iframe>'
IPython.display.HTML(iframe)









    Out[13]:

Starbucks stores | Internet Users Vs International Tourists Arrivals [Visualization 3, Plotly]

As discussed in previous sections, the number of international tourist arrivals per year and internet users (per 100 people) are two important features which have been seen to influence the number of Starbucks stores in a country. These features were determined to be two of the more important features by the ExtraTreesClassifier and even intuitively it does make sense that countries with a large number of tourist arrivals and high Internet accessibility should have a higher number of Starbucks stores.

This relationship between number of Starbucks stores, Internet users and International tourists is depicted using a bubble chart. The bubble chart represents 4 dimensional data as described below.

The X axis represent the international tourist arrivals per year.
The Y axis represent the number of Internet users per 100 people.
Each country is represented by a bubble, the diameter of the bubble representing the number of Starbucks stores in the country. Higher the number of stores, larger the bubble size.
The color of the bubble represents the continent to which the country corresponding to the bubble belongs to.

The bubble chart presented below is created using Plotly, mouse over the bubbles to see information about the country, number of Starbucks stores etc. As with all Plotly charts, basic user interactions like selecting specific areas of the chart, saving as png etc are available. All the data required for the bubble chart is available as part of the combined World Band Starbucks dataset.



In [9]:

    
import IPython
url = 'https://plot.ly/~aa1603/2.embed?link=false'
iframe = '<iframe width=\"1000\" height=\"500\" frameborder=\"0\" scrolling=\"no\" src=\"%s\"></iframe>' %(url)
IPython.display.HTML(iframe)









    Out[9]:

Some important observations from the above visualization are as follows:

Almost all the bubbles which are medium to large in size all occur towards the right side of the 10 Million mark on the x-axis which indicates that unless a country has north of 10 Million international tourists arrival a year it would have very small number of Starbucks store probably around 20 or so with the only exceptions being Brazil, Argentina, Philippines and Indonesia.
The green colored bubbles which correspond to Asia are very visible (most of them are at least medium size with China being the largest). This seems indicates that countries in Asian generally tend to have a higher number of Starbucks stores than counties in other continents. This is also indicated in the boxplot drawn in a previous section, the mean number of Starbucks stores was the highest for Asia.
The largest bubble towards the top right side of the chart is ofcourse the U.S., high availability of Internet access and a very large number of tourists.
France which is the blue colored bubble on the top right corner is relatively much smaller than bubbles representing other countries with high availability of Internet access and a large influx of tourists. This looks like an aberration, France has only 124 Starbucks stores, at more than 80 Million tourists per year and more than 80 out of 100 people having access to the Internet, France should be having a lot more Starbucks stores than just 124.

Increase in Starbucks stores worldwide [Visualizations 4 & 5, Plotly]

The following bar chart shows the increase in Starbucks stores by country from December 2013 to November 2016. Mouseover individual bars to see the name of the country and the number of Starbucks stores added in that country between December 2013 to November 2016. As with all Plotly charts, basic user interactions like selecting specific areas of the chart, saving as png etc are available.



In [12]:

    
import IPython
url = 'https://plot.ly/~aa1603/6.embed?link=false'
iframe = '<iframe width=\"1000\" height=\"500\" frameborder=\"0\" scrolling=\"no\" src=\"%s\"></iframe>' %(url)
IPython.display.HTML(iframe)









    Out[12]:

Some observations from the above visualization:

The maximum increase in terms of absolute number of stores happened in the U.S. followed by China, South Korea, Canada and Great Britain.
Most of the growth is concentrated in the top 10 countries, in fact the increase in the number of stores in U.S. and China put together is way more than the increase in the number of stores in the rest of the world put together (almost 1.35 times more).

While the increase in absolute terms is useful, it is also interesting to find out which countries have the highest rate of increase, basically where is Starbucks expanding fastest. This list may not be the same as the list of countries which have the greatest growth in terms of absolute numbers. To determine this, the following procedure is follows:

A month on month increase in number of Starbucks store is determined for each country.
It is then converted to a percentage of the next month's value.
This percentage month on month increase is then totaled up for the entire duration from December 2013 to November 2016.
The countries with the highest cumulative percentage increase are determined.
- Note that countries with less than 25 stores as of November 2016 are excluded from this analysis (anything less than 25 stores is considered too small to be significant).

The timeseries for the number of stores for 15 countries having the highest cumulative percentage increase is represented below as Sparkline charts (more on Sparklines https://en.wikipedia.org/wiki/Sparkline). Note that this list does not include the U.S. so even though the U.S. had the highest increase in terms of absolute numbers the fastest rate of increase is somewhere else. It is seen from this visualization (and confirmed from the actual numbers, not shown here) that the maximum cumulative percentage increase happens in India where the number of stores increased from 29 in December 2013 to 85 in November 2016.



In [11]:

    
import IPython
url = 'https://plot.ly/~aa1603/74.embed?link=false'
iframe = '<iframe width=\"800\" height=\"1000\" frameborder=\"0\" scrolling=\"no\" src=\"%s\"></iframe>' %(url)
IPython.display.HTML(iframe)









    Out[11]:

Limitations and future work

The following limitations and scope for future study is identified:

Many WDI features had to be discarded because there just wasn't sufficient data available, it is possible that some of these missing features might have been important from a prediction perspective. As and when data for these features becomes available then those features can also be included in the analysis.

Predictions could not be provided for all countries where Starbucks does not have a presence because data for features used for prediction was not available for these countries, once this data becomes available then these predictions can also be provided.

Are we able to increase the accuracy of predictions by reducing the number of categories to 3 (Low, Medium, High) instead of the current 5? This will create a tradeoff between having these categories defined over much narrower interval (and hence being more effective) and the accuracy with which these categories can be predicted.
The python statsmodels module would provide SARIMA (Seasonal ARIMA) at some point (hopefully soon) then the TSA analysis would be done with that model as well.

The parameters of the SARIMA module would be determined by observing the ACF/PACF values, something which has not been done in the current analysis.

The distribution of Starbucks stores across geographies should also be examined by taking a log of the count of Starbucks stores in a city and then creating a histogram would transform the distribution into something more manageable. See plot below. As with all Plotly charts, basic user interactions like selecting specific areas of the chart, saving as png etc are available.



In [10]:

    
import IPython
url = 'https://plot.ly/~aa1603/76.embed?link=false'
iframe = '<iframe width=\"800\" height=\"500\" frameborder=\"0\" scrolling=\"no\" src=\"%s\"></iframe>' %(url)
IPython.display.HTML(iframe)









    Out[10]:



In [ ]:

	brand	city	coordinates	country	country_subdivision	current_timezone_offset	first_seen	latitude	longitude	name	...	ownership_type	phone_number	postal_code	store_id	store_number	street_1	street_2	street_3	street_combined	timezone
0	Starbucks	Hong Kong	{u'latitude': u'22.3407001495361', u'needs_rec...	CN	91	480	2013-12-08T22:41:59	22.340700	114.201691	Plaza Hollywood	...	LS	{u'phone_number': u'29554570'}	NaN	1	34638-85784	Level 2, Plaza Hollywood, Diamond Hill,	Kowloon	NaN	Level 2, Plaza Hollywood, Diamond Hill,, Kowloon	China Standard Time
1	Starbucks	Hong Kong	{u'latitude': u'22.2839393615723', u'needs_rec...	CN	91	480	2013-12-08T22:41:59	22.283939	114.158188	Exchange Square	...	LS	{u'phone_number': u'21473739'}	NaN	6	34601-20281	Shops 308-310, 3/F.,	Exchange Square Podium, Central, HK.	NaN	Shops 308-310, 3/F.,, Exchange Square Podium, ...	China Standard Time
2	Starbucks	Kowloon	{u'latitude': u'22.3228702545166', u'needs_rec...	CN	91	480	2013-12-08T22:41:59	22.322870	114.213440	Telford Plaza	...	LS	{u'phone_number': u'27541323'}	NaN	8	34610-28207	Shop Unit G1A, Atrium A, Telford Plaza I	, Kowloon Bay, Kowloon	NaN	Shop Unit G1A, Atrium A, Telford Plaza I, , Ko...	China Standard Time
3	Starbucks	Hong Kong	{u'latitude': u'22.2844505310059', u'needs_rec...	CN	91	480	2013-12-08T22:41:59	22.284451	114.158463	Hong Kong Station	...	LS	{u'phone_number': u'25375216'}	NaN	13	34622-64463	Concession HOK 3a & b	LAR Hong Kong Station	NaN	Concession HOK 3a & b, LAR Hong Kong Station	China Standard Time
4	Starbucks	Hong Kong	{u'latitude': u'22.2777309417725', u'needs_rec...	CN	91	480	2013-12-08T22:41:59	22.277731	114.164917	Pacific Place, Central	...	LS	{u'phone_number': u'29184762'}	NaN	17	34609-22927	Shop 131, Level 1, Pacific Place	88 Queensway, HK	NaN	Shop 131, Level 1, Pacific Place, 88 Queensway...	China Standard Time

	country_code	IC.TAX.LABR.CP.ZS	WP_time_01.1	SP.POP.1564.TO.ZS	IC.BUS.NDNS.ZS	IC.LGL.CRED.XQ	IC.GOV.DURS.ZS	DT.DOD.PVLX.GN.ZS	SE.ADT.LITR.ZS	IC.EXP.COST.CD	...	SL.SRV.EMPL.ZS	FI.RES.TOTL.DT.ZS	IC.FRM.BRIB.ZS	IC.TAX.OTHR.CP.ZS	IC.REG.COST.PC.ZS	IC.ELC.OUTG	SL.EMP.WORK.ZS	TX.VAL.OTHR.ZS.WT	EN.URB.LCTY.UR.ZS	SI.POV.2DAY
0	BD	NaN	NaN	65.579469	NaN	6.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	13.9	NaN	NaN	62.451180	31.889816	NaN
1	BE	49.4	NaN	64.830742	NaN	4.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	0.6	4.8	NaN	NaN	59.897278	18.516810	NaN
2	BF	21.4	NaN	52.034628	NaN	6.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	3.7	43.5	NaN	NaN	NaN	50.703959	NaN
3	BG	20.2	NaN	65.828506	NaN	9.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	1.8	0.7	NaN	NaN	NaN	23.100215	NaN
4	VE	18.0	NaN	65.627593	NaN	1.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	37.1	88.7	NaN	NaN	14.755198	10.534170	NaN

	Datasource	Invalid_Data_Raw_Score	Invalid_Data_Adjusted_Score	Missing_Data_Raw_Score	Missing_Data_Adjusted_Score
0	WorldBank	100.000000	100.000000	33.648990	33.64899
1	Starbucks	99.999233	99.999233	91.523031	100.00000

	feature	mean	median	mode	stddev
0	IC.TAX.METG	1.467218e+00	1.435000e+00	0	5.794943e-01
1	SL.EMP.TOTL.SP.FE.NE.ZS	4.530958e+01	4.750000e+01	0	1.220095e+01
2	IT.NET.USER.P2	4.755657e+01	4.688564e+01	0	2.769172e+01
3	SP.POP.TOTL	2.970610e+08	1.009667e+07	0	9.376648e+08
4	VC.PKP.TOTL.UN	4.955647e+03	7.930000e+02	0	6.086792e+03
5	IC.GOV.DURS.ZS	8.830681e+00	8.956250e+00	0	4.266066e+00
6	IC.REG.PROC	6.938789e+00	7.000000e+00	0	2.928893e+00
7	IC.LGL.CRED.XQ	5.110327e+00	5.000000e+00	0	2.741384e+00
8	EG.ELC.ACCS.UR.ZS	NaN	NaN	0	NaN
9	IC.BUS.NDNS.ZS	3.674968e+00	2.290769e+00	0	4.270679e+00
10	EN.URB.LCTY.UR.ZS	2.988715e+01	2.684948e+01	0	1.682697e+01
11	SL.EMP.TOTL.SP.MA.NE.ZS	6.361581e+01	6.384569e+01	0	8.498745e+00
12	IC.ELC.DURS	3.351995e+01	3.074444e+01	0	2.697875e+01
13	SP.URB.TOTL.IN.ZS	5.862963e+01	5.837484e+01	0	2.313314e+01
14	NY.GNP.PCAP.CD	1.267481e+04	5.585000e+03	0	1.769632e+04
15	SL.TLF.ACTI.1524.NE.ZS	4.255292e+01	4.113231e+01	0	1.256837e+01
16	IC.WRH.DURS	1.611548e+02	1.540000e+02	0	8.145088e+01
17	IC.TAX.TOTL.CP.ZS	4.088493e+01	3.927222e+01	0	1.914462e+01
18	SI.POV.GINI	4.012033e+01	4.172000e+01	0	9.002209e+00
19	SI.POV.2DAY	1.790452e+01	1.129000e+01	0	2.032023e+01
20	IC.ELC.TIME	9.773877e+01	8.700000e+01	0	6.537607e+01
21	IC.TAX.OTHR.CP.ZS	8.843270e+00	3.000000e+00	0	1.803673e+01
22	IC.ISV.DURS	2.597505e+00	2.581818e+00	0	1.055514e+00
23	FB.ATM.TOTL.P5	4.842994e+01	4.170072e+01	0	4.303515e+01
24	SL.UEM.NEET.ZS	1.347953e+01	1.228000e+01	0	6.136543e+00
25	EG.ELC.RNEW.ZS	NaN	NaN	0	NaN
26	IQ.WEF.PORT.XQ	3.958079e+00	3.904944e+00	0	1.123580e+00
27	IC.TAX.GIFT.ZS	1.529794e+01	1.701000e+01	0	8.678293e+00
28	IQ.WEF.CUST.XQ	3.983466e+00	3.813284e+00	0	8.156708e-01
29	EN.ATM.PM25.MC.ZS	NaN	NaN	0	NaN
...	...	...	...	...	...
53	BX.KLT.DINV.WD.GD.ZS	4.660374e+00	2.556654e+00	0	8.348388e+00
54	TX.VAL.TECH.CD	7.403359e+10	2.120983e+08	0	2.555456e+11
55	EN.CLC.MDAT.ZS	NaN	NaN	0	NaN
56	SL.EMP.WORK.ZS	8.124434e+01	8.417780e+01	0	9.151259e+00
57	SH.STA.ACSN.UR	7.898257e+01	8.797773e+01	0	2.325356e+01
58	WP_time_01.1	5.330584e+01	5.126150e+01	0	3.125552e+01
59	SL.IND.EMPL.ZS	2.338136e+01	2.320000e+01	0	6.635862e+00
60	EN.POP.SLUM.UR.ZS	4.404918e+01	4.300000e+01	0	2.234641e+01
61	SP.POP.1564.TO.ZS	6.374940e+01	6.536425e+01	0	6.471699e+00
62	SE.ADT.LITR.ZS	8.416093e+01	9.397165e+01	0	1.876130e+01
63	FI.RES.TOTL.DT.ZS	9.210792e+01	4.777428e+01	0	2.829776e+02
64	IC.BUS.DISC.XQ	5.470902e+00	5.487101e+00	0	2.288636e+00
65	CM.MKT.TRAD.GD.ZS	6.147025e+01	1.907471e+01	0	1.006608e+02
66	SL.SRV.EMPL.ZS	6.585253e+01	6.905000e+01	0	1.252427e+01
67	NE.CON.PRVT.KD.ZG	2.794388e+00	3.038269e+00	0	4.618765e+00
68	AG.LND.TOTL.UR.K2	NaN	NaN	0	NaN
69	FP.CPI.TOTL.ZG	3.764240e+00	1.835994e+00	0	9.904829e+00
70	ST.INT.RCPT.XP.ZS	1.426136e+01	6.996199e+00	0	1.748747e+01
71	IC.REG.COST.PC.ZS	2.671969e+01	1.397500e+01	0	3.990623e+01
72	LP.LPI.OVRL.XQ	2.867913e+00	2.739469e+00	0	5.229044e-01
73	IC.FRM.BRIB.ZS	2.187780e+01	2.390000e+01	0	1.124342e+01
74	IC.FRM.DURS	2.277239e+01	1.909048e+01	0	1.139392e+01
75	IT.CEL.SETS.P2	1.063206e+02	1.078952e+02	0	3.922009e+01
76	IC.IMP.COST.CD	1.883703e+03	1.480000e+03	0	1.511363e+03
77	IC.EXP.COST.CD	1.564382e+03	1.270810e+03	0	1.127318e+03
78	IC.TAX.LABR.CP.ZS	1.572635e+01	1.405870e+01	0	9.984998e+00
79	IC.ELC.OUTG	7.778588e+00	5.650000e+00	0	8.081534e+00
80	EN.ATM.CO2E.EG.ZS	NaN	NaN	0	NaN
81	DT.DOD.PVLX.GN.ZS	2.090316e+01	1.743921e+01	0	1.352618e+01
82	Ease.Of.Doing.Business	0.000000e+00	0.000000e+00	VeryLow:70	0.000000e+00

	feature	mode
0	brand	Starbucks:24733
1	city	上海市:498
2	country	US:13486
3	ownership_type	CO:11777
4	timezone	Eastern Standard Time:5460
5	continent	North America:15568
6	on_airport	False:24641
7	eodb_category	VH:15549

	dataset	entry	field	value	3rdStdDev
0	WB	Early-demographic dividend	SP.POP.TOTL	3.122703e+09	2.812994e+09
1	WB	Middle income	SP.POP.TOTL	5.521157e+09	2.812994e+09
2	WB	IBRD only	SP.POP.TOTL	4.542581e+09	2.812994e+09
3	WB	Low & middle income	SP.POP.TOTL	6.159443e+09	2.812994e+09
4	WB	IDA & IBRD total	SP.POP.TOTL	6.183635e+09	2.812994e+09
5	WB	World	SP.POP.TOTL	7.346633e+09	2.812994e+09
6	WB	Bhutan	IC.GOV.DURS.ZS	2.880000e+01	1.279820e+01
7	WB	Equatorial Guinea	IC.REG.PROC	1.800000e+01	8.786678e+00
8	WB	Venezuela, RB	IC.REG.PROC	1.700000e+01	8.786678e+00
9	WB	Philippines	IC.REG.PROC	1.600000e+01	8.786678e+00
10	WB	New Zealand	IC.BUS.NDNS.ZS	1.663000e+01	1.281204e+01
11	WB	Malta	IC.BUS.NDNS.ZS	1.726000e+01	1.281204e+01
12	WB	Hong Kong SAR, China	IC.BUS.NDNS.ZS	3.130000e+01	1.281204e+01
13	WB	Hong Kong SAR, China	EN.URB.LCTY.UR.ZS	1.000000e+02	5.048091e+01
14	WB	Macao SAR, China	EN.URB.LCTY.UR.ZS	9.945780e+01	5.048091e+01
15	WB	Singapore	EN.URB.LCTY.UR.ZS	1.000000e+02	5.048091e+01
16	WB	Ethiopia	IC.ELC.DURS	1.943000e+02	8.093626e+01
17	WB	Luxembourg	NY.GNP.PCAP.CD	7.700000e+04	5.308897e+04
18	WB	Switzerland	NY.GNP.PCAP.CD	8.418000e+04	5.308897e+04
19	WB	Qatar	NY.GNP.PCAP.CD	8.543000e+04	5.308897e+04
20	WB	Norway	NY.GNP.PCAP.CD	9.382000e+04	5.308897e+04
21	WB	Macao SAR, China	NY.GNP.PCAP.CD	7.630000e+04	5.308897e+04
22	WB	Barbados	IC.WRH.DURS	4.420000e+02	2.443526e+02
23	WB	Cyprus	IC.WRH.DURS	6.170000e+02	2.443526e+02
24	WB	Cambodia	IC.WRH.DURS	6.520000e+02	2.443526e+02
25	WB	Brazil	IC.WRH.DURS	4.257000e+02	2.443526e+02
26	WB	Zimbabwe	IC.WRH.DURS	4.480000e+02	2.443526e+02
27	WB	Argentina	IC.TAX.TOTL.CP.ZS	1.374000e+02	5.743387e+01
28	WB	Comoros	IC.TAX.TOTL.CP.ZS	2.165000e+02	5.743387e+01
29	WB	Bangladesh	IC.ELC.TIME	4.289000e+02	1.961282e+02
...	...	...	...	...	...
118	WB	Macao SAR, China	ST.INT.RCPT.XP.ZS	9.367504e+01	5.246241e+01
119	WB	South Sudan	IC.REG.COST.PC.ZS	3.301000e+02	1.197187e+02
120	WB	Central African Republic	IC.REG.COST.PC.ZS	2.040000e+02	1.197187e+02
121	WB	Haiti	IC.REG.COST.PC.ZS	2.353000e+02	1.197187e+02
122	WB	Chad	IC.REG.COST.PC.ZS	1.504000e+02	1.197187e+02
123	WB	Djibouti	IC.REG.COST.PC.ZS	1.681000e+02	1.197187e+02
124	WB	Mauritania	IC.FRM.DURS	6.460000e+01	3.418175e+01
125	WB	Kuwait	IT.CEL.SETS.P2	2.317632e+02	1.176603e+02
126	WB	Hong Kong SAR, China	IT.CEL.SETS.P2	2.288316e+02	1.176603e+02
127	WB	Macao SAR, China	IT.CEL.SETS.P2	3.244408e+02	1.176603e+02
128	WB	Zambia	IC.IMP.COST.CD	7.060000e+03	4.534090e+03
129	WB	South Sudan	IC.IMP.COST.CD	9.285000e+03	4.534090e+03
130	WB	Congo, Rep.	IC.IMP.COST.CD	7.590000e+03	4.534090e+03
131	WB	Tajikistan	IC.IMP.COST.CD	1.065000e+04	4.534090e+03
132	WB	Uzbekistan	IC.IMP.COST.CD	6.452000e+03	4.534090e+03
133	WB	Chad	IC.IMP.COST.CD	9.025000e+03	4.534090e+03
134	WB	Zambia	IC.EXP.COST.CD	5.165000e+03	3.381954e+03
135	WB	South Sudan	IC.EXP.COST.CD	5.335000e+03	3.381954e+03
136	WB	Tajikistan	IC.EXP.COST.CD	9.050000e+03	3.381954e+03
137	WB	Afghanistan	IC.EXP.COST.CD	5.045000e+03	3.381954e+03
138	WB	Central African Republic	IC.EXP.COST.CD	5.490000e+03	3.381954e+03
139	WB	Uzbekistan	IC.EXP.COST.CD	5.090000e+03	3.381954e+03
140	WB	Kazakhstan	IC.EXP.COST.CD	5.285000e+03	3.381954e+03
141	WB	Chad	IC.EXP.COST.CD	6.615000e+03	3.381954e+03
142	WB	Belgium	IC.TAX.LABR.CP.ZS	4.940000e+01	2.995499e+01
143	WB	France	IC.TAX.LABR.CP.ZS	5.350000e+01	2.995499e+01
144	WB	Nigeria	IC.ELC.OUTG	3.280000e+01	2.424460e+01
145	WB	Papua New Guinea	IC.ELC.OUTG	4.190000e+01	2.424460e+01
146	WB	Bhutan	DT.DOD.PVLX.GN.ZS	6.815542e+01	4.057852e+01
147	WB	Cabo Verde	DT.DOD.PVLX.GN.ZS	6.770737e+01	4.057852e+01

	name	IC.BUS.EASE.XQ	Ease.Of.Doing.Business
0	Belgium	43.000000	Medium
1	Equatorial Guinea	180.000000	VeryLow
2	Early-demographic dividend	117.868852	Low
3	Small states	107.354167	Low
4	Bangladesh	174.000000	VeryLow

	Datasource	Combined_Score	Invalid_Data_Raw_Score	Invalid_Data_Adjusted_Score	Missing_Data_Raw_Score	Missing_Data_Adjusted_Score
0	WorldBank	76.92776	100.00000	100.00000	53.855519	53.855519
1	Starbucks	99.99981	99.99962	99.99962	91.529921	100.000000

	Dataset	feature1	feature2	r
0	WB	IC.BUS.EASE.XQ	SL.GDP.PCAP.EM.KD	-0.557920
1	WB	IC.BUS.EASE.XQ	IT.NET.USER.P2	-0.831488
2	WB	SL.GDP.PCAP.EM.KD	IT.NET.USER.P2	0.741431

	Dataset	feature1	feature2	r
0	combined	Num.Starbucks.Stores	IT.NET.USER.P2	0.028446
1	combined	Num.Starbucks.Stores	ST.INT.ARVL	0.521229
2	combined	Num.Starbucks.Stores	SP.POP.TOTL	0.263139
3	combined	IT.NET.USER.P2	ST.INT.ARVL	0.087930
4	combined	IT.NET.USER.P2	SP.POP.TOTL	-0.324875
5	combined	ST.INT.ARVL	SP.POP.TOTL	0.311971

	name	KMeans_labels	SL.GDP.PCAP.EM.KD	Ease.Of.Doing.Business	ST.INT.ARVL.Categorical	SP.POP.TOTL.Categorical	Num.Starbucks.Stores	Starbucks.Store.Density
0	Belgium	0	98644.171875	Medium	M	M	19	0.168354
1	Trinidad and Tobago	2	61845.410156	Medium	VL	L	1	0.073525
2	Oman	3	85342.617188	Medium	L	L	12	0.267228
3	Aruba	3	65623.968183	NaN	VL	VL	3	2.887697
4	Luxembourg	0	201747.796875	Medium	VL	VL	2	0.351077
5	Lebanon	1	42497.589844	Low	L	M	29	0.495664
6	El Salvador	1	18404.509766	Medium	VL	M	10	0.163223
7	Romania	3	37818.410156	Medium	M	H	25	0.126056
8	Costa Rica	3	30870.919922	Medium	M	M	11	0.228792
9	Argentina	1	31734.800781	Low	M	H	105	0.241842

	rule	frequency	support	confidence
0	R:({SBS_M})->TOURIST_ARR_M	15	0.208333	0.535714
18	R:({TOURIST_ARR_M})->SBS_M	15	0.208333	0.535714
29	R:({TOURIST_ARR_VH})->SBS_VH	9	0.125000	0.600000
12	R:({SBS_VH})->TOURIST_ARR_VH	9	0.125000	0.600000
19	R:({TOURIST_ARR_M})->SBS_H	7	0.097222	0.250000

	rule	frequency	support	confidence
0	R:({POPT_M,TOURIST_ARR_M})->SBS_M	7	0.097222	0.538462
35	R:({POPT_VH,TOURIST_ARR_VH})->SBS_VH	6	0.083333	0.750000
9	R:({POPT_M,TOURIST_ARR_H})->SBS_M	4	0.055556	0.500000
20	R:({POPT_VL,TOURIST_ARR_VL})->SBS_VL	4	0.055556	0.800000
1	R:({POPT_M,TOURIST_ARR_M})->SBS_H	3	0.041667	0.230769

	Num.Starbucks.Stores.Categorical	country_codes	names
0	H	1W	World
1	H	XC	Euro area
2	H	XF	IBRD only
3	H	Z7	Europe & Central Asia
4	H	T7	Europe & Central Asia (IDA & IBRD countries)
5	H	XJ	Latin America & Caribbean (excluding high income)
6	H	T2	Latin America & the Caribbean (IDA & IBRD coun...
7	H	EC	Ecuador
8	H	XH	IDA blend
9	H	XT	Upper middle income
10	L	BF	Burkina Faso
11	L	UG	Uganda
12	L	PY	Paraguay
13	L	MN	Mongolia
14	M	LV	Latvia
15	M	DO	Dominican Republic
16	M	HK	Hong Kong SAR, China
17	M	DZ	Algeria
18	M	IT	Italy
19	M	EE	Estonia
20	M	AM	Armenia
21	M	LK	Sri Lanka
22	M	MU	Mauritius
23	M	HR	Croatia
24	M	MD	Moldova
25	M	GY	Guyana
26	M	GE	Georgia
27	M	BJ	Benin
28	M	AL	Albania
29	M	MK	Macedonia, FYR
30	M	ZQ	Middle East & North Africa
31	M	LT	Lithuania
32	M	ME	Montenegro
33	M	XU	North America
34	M	NP	Nepal
35	M	UA	Ukraine
36	M	B8	Central Europe and the Baltics
37	M	RW	Rwanda
38	M	JM	Jamaica
39	M	SI	Slovenia
40	VH	XD	High income
41	VH	OE	OECD members
42	VH	EU	European Union
43	VL	HN	Honduras
44	VL	SL	Sierra Leone
45	VL	SR	Suriname
46	VL	IS	Iceland
47	VL	MZ	Mozambique
48	VL	TZ	Tanzania
49	VL	7E	Europe & Central Asia (excluding high income)
50	VL	IL	Israel
51	VL	UY	Uruguay
52	VL	NI	Nicaragua
53	VL	BA	Bosnia and Herzegovina

	country	city_with_most_starbucks_stores	count_in_city_with_most_stores
0	United States	New York	231
1	United Kingdom	London	195
2	UAE	Dubai	75
3	Kuwait	مدينة الكويت	74
4	Republic of Korea	Seoul	234
5	Canada	Toronto	191